Instruments that use ordinal response data (e.g., Likert scale data, or partial credit scoring) are widely used in educational research studies. When using ordinal response data, research practitioners often need to handle potential response sparseness. Response sparseness occurs when a response option (or score) is rarely, if ever, endorsed (or earned). The common practice for addressing response sparseness is to combine adjacent response categories to either reduce the number of response options or dichotomize a polytomous response item to obtain a significant number of responses per category (e.g., Outpatient and Ambulatory Surgery CAHPS, 2017), to improve data model fit (e.g., National Center for Education Statistics, 2008), or to reshape the data to meet specific modeling needs (e.g., Harpe, 2015). Collapsing response categories to create a dichotomous response item from what was originally a polytomous response item can introduce significant data-model misfit (Jansen & Roskam, 1986) along with a loss of power, reduced scale reliability, and spurious statistical significance when determining measurement invariance (Altman & Royston, 2006; MacCallum et al., 2002; Rutkowski et al., 2019). When collapsing categories to improve polytomous data-model fit, observed improved model fit may be spurious and only applicable to the collected sample (Rutkowski et al., 2019).
Utilizing a category collapse approach is also commonly observed in item response theory (IRT) analyses of ordinal response data to address disordered thresholds (e.g., Kim et al., 2010; Linden et al., 2020; Matovu, 2019; Smith et al., 2003). When modeling ordinal response data with an adjacent-categories IRT model, threshold order indicates how well an instrument functions. Disordered thresholds may indicate sparse response data in one (or more) of the response categories (Wind, 2023). Threshold order depends on the number of participants responding to an item, and collapsing the central response option to address threshold ordering (e.g., Rost & Von Davier, 1995) can complicate the understanding of the original scale by mixing respondents whose central response reflects their latent trait with other respondents (Wetzel & Carstensen, 2014). Recent research suggests that when using a Partial Credit IRT model (PCM), disordered thresholds should be addressed by collapsing response options closest to the disordered threshold category. When the central response option was affected, symmetrically collapsing response categories, collapsing from both ends of the response continuum simultaneously, best addressed the disordered thresholds. (Tsai et al., 2024).
The action of collapsing categories changes the observed item response matrix, and in-turn manipulates the conditional probabilities of a respondent selecting a response option or earning a particular score (Tsai et al., 2024). This action violates the joining assumption (Jansen & Roskam, 1986) needed to fit logistic-adjacent models, like the Partial Credit Model (PCM). Previously, Harel (2014) and subsequently Harel and Steele (2018) provided an in-depth discussion on this topic when fitting a PCM.
In their 2014 paper, Harel demonstrated through mathematical derivation, that fitting a PCM to data with collapsed categories would result in deliberate model misspecification. Harel proposed three category collapse rules based on: (1) item response function, (2) maximum information, and (3) integrated information. Simulation studies were performed to test the proposed rules and the effect of category collapse on person parameter recovery, and study how category collapse influenced threshold parameter recovery. However, none of the proposed rules could be utilized to make a consistent decision on when category collapse is appropriate. Harel and Steele (2018) extended this area of research by proposing an information matrix test (IMT) to assess the degree of data-model misspecification introduced by collapsing the central response category. They concluded that the direction the central category was collapsed influenced PCM data-model fit significantly, and this misspecification was detectable using the proposed Information Matrix test.
Although Harel and many other researchers have concluded that utilizing collapsed categories negatively impacts statistical analyses, there exists a significant amount of research utilizing data with collapsed categories. MacCallum et al. (2002) found that 11.50% of articles published between 1998 and 2000 in the Journal of Personality and Social Psychology, Journal of Consulting and Clinical Psychology, and Journal of Counseling Psychology contained at least one instance of dichotomization through category collapse. Additionally, only 20% of these studies provided justification for dichotomizing responses. This finding illustrates the need for a comprehensive guide on when category collapse is justified and proper guidelines for implementing category collapse.
There are two main open questions concerning category collapse that this paper aims to address. Firstly, despite the significant research performed on collapsing response categories to address threshold order and data-model fit, there is no clear conclusion on whether category collapse introduces bias into polytomous IRT model item parameter estimates. If biased item parameter estimates are obtained, it is also unclear to what extent the ability estimates may be impacted. Therefore, the question remains: by collapsing response categories to address threshold order and/or data model fit, are we inadvertently biasing item and person parameter estimates?
Secondly, the majority of previous research concerning category collapse focused on the Partial Credit IRT model. In practice, the primary two polytomous IRT models used to model education data are the Generalized Partial Credit Model (GPCM; Muraki, 1992) and the Graded Response Model (GRM; Samejima, 1968). However, to date, there has not been research focused on examining and comparing what effect category collapse has on these data generating models. Therefore, the question remains about if these results also hold for GRM and GPCM data.
Thus, the purpose of this study is to extend the literature on category collapse by investigating, through rigorous Monte Carlo simulations, if category collapse can be justified when utilizing the Graded Response and Generalized Partial Credit Item Response Theory models. We extend the literature by examining if sample size, the direction of collapse, and the number of responses within a collapsed category have an influence on Graded Response Model and Generalized Partial Credit model parameter estimation and data-model fit. By performing this in-depth investigation, we can develop recommendations for educational research practitioners who want to collapse adjacent response options while introducing the least amount of bias in their results.
We begin by presenting relevant IRT methodology along with discussing statistical methodology used for item parameter estimation. The second section presents our simulation study and results. Lastly, we present an empirical example of category collapse and discuss how our simulation study results are supported by this study. We conclude with recommendations for research practitioners based on our study.
Methodology
Item Response Theory Methodology
Two of the most popular unidimensional polytomous Item Response Theory models used to analyze ordinal response data are the Graded Response and Generalized Partial Credit models. While these two models can be used to analyze ordinal response data, education research practitioners should be aware of the fundamental differences between these two models.
The Graded Response Model
The Graded Response model is used to model the probability of a respondent endorsing a particular response. For example, consider an instrument composed of items that have Likert scale responses coded as where zero is a low endorsement of the latent trait and is high endorsement. Then the probability of an examinee with ability level endorsing Category on Item is defined as:
1
where
with item-specific discrimination Parameter and many threshold Parameters . Furthermore, we define and .
The Generalized Partial Credit Model
In contrast, the Generalized Partial Credit model is used to model the probability of an examinee earning the score out of a possible score of . Respondents who receive maximum credit on an item earn a score of , and this score decreases to 0 as respondents make mistakes. A fundamental assumption of the Generalized Partial Credit model is that potential item scores are ordered such that a respondent who earns a score of has correctly satisfied the requirements to earn a score of But the level of difficulty moving from score of to may not necessarily be higher than the level of difficulty stepping up from to , hence the threshold parameters need not to be strictly ordered. In contrast, in GRM, since it is a “difference model,” all threshold parameters need to be strictly ordered. Then, the probability of an examinee with ability level earning the score on Item is:
2
with item specific discrimination parameter and threshold parameters . Note that the Index is not associated with IRT parameters and is only used to index the summations. We also define .
The Influence of Category Collapse on GRM and GPCM Parameter Estimation
Comparing Equations 1 and 2 we can see that, while both the GRM and GPCM are used to model polytomous item response data, they have significantly different functional forms. The GRM belongs to the “difference” model family while the GPCM belongs to the “divide-by-total” family of IRT models (García-Pérez, 2017; Thissen & Steinberg, 1986). This suggests that category collapse may influence each model differently. In this section, we highlight how threshold parameter estimation might be influenced by category collapse.
In Appendix A we derive the complete-data log-likelihood function and partial derivatives for GRM and GPCM respectively. From this derivation we can conclude that if a GRM is fit to the observed item response, the direction of collapse does not influence threshold parameter estimation. Collapsing the highest two parameters involves removing the threshold parameter corresponding to the highest response option. Similarly, when collapsing into a lower response category, the threshold parameter for the higher response option is removed. After removing the collapsed threshold, a GRM can still be used to estimate threshold parameters. The derivation is based on theory, and it alone will not provide a practical guide as to when collapse is needed. Instead, simulation studies must be used to determine thresholds for sparseness based on empirical evidence.
In contrast to GRM, using collapsed categories with GPCM changes the underlying threshold parameter estimation equation. Threshold parameters in GRM models the probability of responding to Category or above. The link between non-adjacent categories allows for removal of a threshold parameter without changing the underlying IRT model. In contrast, threshold parameters in GPCM relate to two adjacent response categories with no connection to other response categories. Therefore, collapsing adjacent response categories by removing a threshold parameter influences the estimation of all other response categories. This collapse process changes the underlying IRT model. Prior research has found similar conclusions (e.g., Harel, 2014; Harel & Steele, 2018). However, these prior studies did not examine if the data-model misfit significantly biases parameter estimation. The degree to which parameter estimates are influenced, can only be studied through simulation.
Testing Data-Model Fit
Here, we are interested in using the Generalized test for polytomous item response data (Kang & Chen, 2008) to validate the conclusions presented in prior category collapse research. That: (1) analyzing collapsed GPCM data by fitting a GPCM results in significant data-model misfit percentage, and (2) analyzing collapsed GRM data by fitting a GRM results in a nominal data-model misfit percentage. If Conclusion (1) holds, we would observe a large percentage of GPCM simulations using collapsed data to be flagged for data-model misfit as compared to baseline data-model fit percentages. If Conclusion (2) holds, we would observe a similar percentage of data-model misfit for all GRM simulations. A brief outline of the test is provided below.
The generalized test is used to determine how well a proposed IRT model matches the observed item responses. The test statistic follows an asymptotic Chi-Square distribution, and its corresponding p-value can be interpreted similar to traditional Chi-Square goodness of fit tests. The test statistic is calculated as:
3
where is the item response category/score, is the highest possible score/response category of Item , is a perfect test score, and is the number of examines in Group . Note that the outer summation starts at because groups with extreme test scores (e.g., all correct or all incorrect) will have an expected proportion of zero. The conditional expected and observed category proportions, and respectively, along with its degrees of freedom are calculated using methodology outlined in Kang and Chen (2008).
Simulation Study Methodology
For the simulation study we began by generating two datasets, one using a GRM data generating model and the other using a GPCM data generating model. Each having 12 items with 5 response Categories (0, 1, 2, 3, 4). We induced sparseness in response Category 3 by manipulating the data generating thresholds for the parameter until we achieved an endorsement rate of approximately 5% for Items 7–9, and approximately 2.5% for Item 10–12. No other response categories, aside from Category 3, contained sparse endorsement rates. Data were generated using the MIRT package (Chalmers, 2012) within R Statistical Software v4.3.2 (R Core Team, 2023). Data generating parameters are provided in Appendix B.
Two endorsement thresholds were used to determine when category collapse is appropriate: (1) When a response category is endorsed by 5% or less of respondents, and (2) When a response category is endorsed by 2.5% or less of the respondents. Using the 2.5% endorsement threshold Items 10–12 would have Category 3 collapsed into an adjacent category, and using the 5% endorsement threshold would result in Items 7–12 having Category 3 collapsed into an adjacent category.
This collapse process resulted in 6 unique datasets for each data-generating process is outlined in Table 1 below. The Baseline dataset which is the original data that does not contain any collapsed categories, (2a & 2b) The Collapsed-Up dataset where responses for Category 3 were recoded into responses for Category 4, and (3a & 3b) The Collapsed-Down dataset where responses for Category 3 were recoded into responses for Category 2. Datasets 2a and 3a result from using a 5% endorsement threshold, and Datasets 2b and 3b are from a 2.5% endorsement threshold. Each of these five datasets were generated using one of six sample sizes: 150, 250, 500, 1000, 1500, and 2000.
Table 1
Data Generation Process
| Sparseness Condition | ||||||
|---|---|---|---|---|---|---|
| Simulation Condition | 2.5% | 5% | ||||
| Collapse Direction | Baseline | Collapsed Up | Collapsed Down | Baseline | Collapsed Up | Collapsed Down |
| Sample Size | 150, 250, 500, 1000, 1500, 2000 | 150, 250, 500, 1000, 1500, 2000 | 150, 250, 500, 1000, 1500, 2000 | 150, 250, 500, 1000, 1500, 2000 | 150, 250, 500, 1000, 1500, 2000 | 150, 250, 500, 1000, 1500, 2000 |
| Dataset Labeling | 1a | 2a | 3a | 1b | 2b | 3b |
After generating the analysis datasets, a Generalized Partial Credit or Graded Response IRT model was fitted to the data depending on which data generating model was used. Parameter estimation was obtained using standard EM algorithm with fixed quadrature within the MIRT package. To determine how closely recovered item parameters from collapsed datasets match parameters recovered from the baseline dataset we calculated the relative bias (Hoogland & Boomsma, 1998) of estimated item parameters. Using the discrimination parameter as an example we denote as the simulated true discrimination parameter and as the mean of the recovered discrimination parameters for the item over all simulations. Then the relative bias of the discrimination parameter for the item is calculated as:
4
Relative bias values more extreme than indicate that the item parameter of interest was not well recovered.
When collapsing categories, the number of threshold parameters changes. Figure 1 displays how threshold parameters change when collapsing categories. The parameters on the top of the number line denote the simulated true data generating parameters and the parameters on the bottom of the number line denote the recovered item parameters. When the data does not contain any collapsed categories (Figure 1a) we can directly compare recovered parameters which are denoted as and . When collapsing response Category 3 up into Category 4 (Figure 1b) we are essentially removing the parameter and directly comparing the remaining item parameters. When collapsing response Category 3 down into Category 2 (Figure 1c) we are removing the parameter.
To determine how closely the recovered person parameters matched the simulated true data generating person parameters, we calculated the simulation average bias and Root Mean Squared Error (RMSE) of recovered person parameters. RMSE is calculated using Equation 5 where is the recovered person parameter of Person and is the simulated true data generating person parameter for Person .
5
To understand if category collapse results in detectable model misfit we calculated the percentage of items that were flagged for model misspecification across all simulations. In summary, our simulation study is a completely crossed design with (a) 6 sample size levels (150, 250, 500, 1000, 1500, 2000), (b) 2 levels of response sparseness for Category 3 (5%, 2.5%), (c) 3 collapse directions (baseline, collapsed-up, collapsed-down), and (d) 2 IRT models (GRM and GPCM).
Simulation Study Results
Category Collapse and Data-Model Fit
Figures 2 and 3 below display the percentage of simulations (out of 500) that an item containing collapsed categories was flagged for data-model misfit using an alpha level of . When examining GRM data-model fit (Figure 2), we observed slightly inflated Type I errors for Items 7–12 in the GRM baseline dataset. This suggests that the test is sensitive to sparseness in GRM data. Kang and Chen (2008) also observed inflated Type 1 errors when using the with sparse response frequencies. Kang and Chen concluded that despite the inflated Type-I, the test can still be used for determining GRM data-model fit. When fitting a GRM to data containing collapsed categories, the data-model fit improves for items containing collapsed categories. In almost all cases, the percentage of flagged simulations decreased after category collapse.
Figure 2
Percentage of Flagged Simulations: Items 7–12. GRM Data
Note. When using a 2.5% endorsement sparseness collapse rule Items 10–12 contained collapsed categories and using a 5% endorsement sparseness collapse rule Items 7–12 contained collapsed categories.
Figure 3
Percentage of Flagged Simulations: Items 7–12. GPCM Data
Note. When using a 2.5% endorsement sparseness collapse rule Items 10–12 contained collapsed categories and using a 5% endorsement sparseness collapse rule Items 7–12 contained collapsed categories.
Conversely, when using GPCM with collapsed categories (Figure 3) the percentage of simulations flagged for data-model misfit increased. This result confirms research performed in Harel (2014) by demonstrating that collapsing categories with GPCM data introduces undesirable data-model misfit. Additionally, unique to our study, we can see from Figure 3 that the power to detect this misfit is only present at larger sample sizes. At smaller sample sizes, collapsing GPCM response categories does not seem to worsen the data-model fit.
Category Collapse and Person Parameter Recovery
The person parameter was well recovered when fitting a graded response model for all sample size, collapse direction, and sparseness conditions. The average bias of recovered person parameters over 500 simulations was very close to zero in all situations. Additionally, collapsing response Category 3 resulted in a decrease of the RMSE, suggesting that person parameters were better recovered under the collapsed conditions. These results are presented in Figures 4 and 5. This finding supports the research performed in Jiang (2018). When adjacent response categories are collapsed due to response sparseness person parameter recovery was not affected. These results support that as much as 5% of the total number of responses can be collapsed without influencing GRM person parameter recovery with a sample size as small as 150.
Figure 4
Average Bias of Recovered GRM Person Parameters
Figure 5
RMSE of Recovered GRM Person Parameters
Similar results were observed when examining person parameter recovery when using the GPCM with collapsed categories (Figures 6 and 7). The average bias was very close to zero for all simulation conditions. Standard errors were consistently around 0.30 for all simulation conditions. RMSE of recovered person parameters was smallest when no responses were collapsed. Both collapse conditions resulted in similar RMSE values. However, the increase in RMSE was marginal (> .10). These results suggest that despite the intentional model-misspecification induced by using GPCM with collapsed categories, person parameter recovery is not affected.
Figure 6
Average Bias of Recovered GPCM Person Parameters
Figure 7
RMSE of Recovered GPCM Person Parameters
Category Collapse and Item Parameter Recovery
When fitting a Graded Response model using a 5% collapse rule Items 7–12 contain collapsed categories. The slope parameter was well recovered for all items containing collapsed categories. All the threshold parameters were also well recovered from items containing collapsed categories. Using a 2.5% collapse rule Items 10–12 contained collapsed categories. Similar to the 5% category collapse results the slope and threshold parameters were well recovered from all items containing collapsed categories. In general, item parameters were best recovered when response Category 3 was collapsed up into response Category 4. These results are displayed in Figures 8 and 9 respectively.
Figure 8
Relative Bias of Recovered GRM Item Parameters 5% Collapse Condition
Note. When using a 5% endorsement sparseness collapse rule Items 7–12 contained collapsed categories. The dashed horizontal lines indicate the |0.05| threshold for extreme relative bias.
Figure 9
Relative Bias of Recovered GRM Item Parameters 2.5% Collapse Condition
Note. When using a 2.5% endorsement sparseness collapse rule Items 10–12 contained collapsed categories. The dashed horizontal lines indicate the |0.05| threshold for extreme relative bias.
Figures 10 and 11 display the relative bias of recovered item parameters when using a 5% and 2.5% collapse rule respectively for response Category 3 with GPCM. Using a 5% collapse rule, where Items 7–12 have collapsed categories, we can see from Figure 10 that when using data with collapsed categories item parameters are not well recovered. The parameter displays significant bias regardless of collapse direction, however collapsing up seems to introduce the least amount of bias into the parameter estimate. In addition, the and the parameters display significant bias when using collapsed down data. This suggests that using GPCM with collapsed categories not only biases the parameters related to the target response category, but also influences other response categories.
Figure 10
Relative Bias of Recovered GPCM Item Parameters 5% Collapse Condition
Note. When using a 5% endorsement sparseness collapse rule Items 7–12 contained collapsed categories. The dashed horizontal lines indicate the |0.05| threshold for extreme relative bias.
Figure 11
Relative Bias of Recovered GPCM Item Parameters 2.5% Collapse Condition
Note. When using a 2.5% endorsement sparseness collapse rule Items 10–12 contained collapsed categories. The dashed horizontal lines indicate the |0.05| threshold for extreme relative bias.
Using a 2.5% collapse threshold (Figure 11) resulted in lower (but still significant) relative bias values. From Figure 11 we can also compare parameter estimates between items without collapsed categories (Items 7–9) and items with collapsed categories (Items 10–12). Using collapsed categories significantly increases the relative bias in item parameter estimates. All item parameters display significant relative bias values regardless of collapse direction. Similar to the 5% collapse condition, collapsing response Category 3 up induces lower (but still significant) relative bias values.
Empirical Example of Category Collapse
In this section we analyzed two-cohort repeated measure data from the Alzheimer’s Disease Neuroimage Initiative (ADNI). 1,656 ADNI participants were recruited from 57 sites in the United States and Canada. Participants were between the ages of 55 and 90. Participants responded to series of initial tests that were repeated at intervals over subsequent years. We used cognitive battery data from 2 ADNI cohorts: ADNI1 and ADNI2/ADNI GO. Specifically, we focus on the memory section (ADNI-MEM) of the cognitive battery.
To address MCMC convergence errors, cognitive battery polytomous item response categories were combined if less than 20 individuals endorsed a particular category. Dichotomous response items were dropped if less than 20 individuals endorsed an option. ADNI1 collapsed categories were used for any shared items with ADNI2/ADNI GO. Table 2 below shows how the items measuring memory were recoded (Gibbons et al., 2012; Wang et al., 2023).
Table 2
Recoded Response Categories ADNI-MEM in Gibbons et al.
| Recoded Score | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| ADNI-MEM Measure | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
| RAVLT Trial 1 | 0–2a | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11–15a |
| RAVLT Trial 2 | 0–2a | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11–15a |
| RAVLT Trial 3 | 0–2a | 3 | 4 | 5–6a | 7–8a | 9 | 10 | 11 | 12 | 13–14a |
| RAVLT Trial 4 | 0–3a | 4 | 5–6a | 7–8a | 9 | 10 | 11 | 12 | 13 | 14–15a |
| RAVLT Trial 5 | 0–3a | 4 | 5 | 6–7a | 8–9a | 10–11a | 12 | 13 | 14 | 15 |
| Interference | 0–1a | 2 | 3 | 4 | 5 | 6 | 8 | 7 | 8–15a | |
| Immediate recall | 0 | 1–2a | 3–4a | 5–6a | 7 | 8 | 9 | 10–11a | 12–13a | 14–15a |
| 30-minute delay | 0 | 1–2a | 3–4a | 5–6a | 7 | 8 | 9 | 10–11a | 12–13a | 14–15a |
| Recognition | 0 | 1 | 2–3a | 4–5a | 6–7a | 8–9a | 10–11a | 12–13a | 14 | 15 |
| ADAS Cog – Trial 1 | 0–1a | 2 | 3 | 4 | 5 | 6 | 7 | 8–10a | ||
| ADAS Cog – Trial 2 | 0–2a | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
| ADAS Cog – Trial 3 | 0–2a | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
| Recall | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9–10a |
| Recognition present | 0–3a | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
| Recognition absent | 0–4a | 5–6a | 7 | 8 | 9 | 10 | 11 | 12 | ||
| Logical Memory – Immediate | 0–1a | 2–3a | 4–5a | 6–7a | 8–9a | 10–12a | 13–14a | 15–16a | 17–18a | 19–25a |
| Logical Memory – Delay | 1 | 1–2a | 3–4a | 5–8a | 9–11a | 12 | 13 | 14–15a | 16–17a | 18–25a |
| MMSE Ball Recall | 2 | 1 | ||||||||
| MMSE Flag Recall | 2 | 1 | ||||||||
| MMSE Tree Recall | 2 | 1 | ||||||||
Note. Cell values indicate recoded score values.
a indicates that adjacent response categories were combined.
A 2-Parameter Logistic (2PL) model was fit to items with binary responses and a Graded Response Model (GRM) was fit to items with polytomous responses. GRM was selected for polytomous responses due to the Likert-type data collected. This fitting was performed twice: once on the original dataset and once on the collapsed dataset. Person parameters were extracted from each model fitting. Bias, Root Mean Square Error (RMSE), and average Standard Error (SE) was calculated with the original dataset serving as the “true” person parameters.
From Table 3 we can see that collapsing response categories did not have a negative effect on person parameter recovery. The bias between person parameter estimates from the original dataset (without collapsed categories) to person parameter estimates using collapsed data was very close to zero for all four datasets. The RMSE was also very close to zero indicating that the person parameter was well recovered after collapsing response categories. The average standard error of person parameter estimates increased slightly after collapsing response categories, but the increase was not significant.
Table 3
ADNI-MEM Person Parameter Recovery Summary Statistics
| Data | Bias | RMSE | Average SE Change |
|---|---|---|---|
| ADNI1 Baseline | 0.001 | 0.057 | 0.007 |
| ADNI1 Follow up | 0.001 | 0.063 | 0.008 |
| ADNI2 Baseline | 0.004 | 0.054 | 0.005 |
| ADNI2 Follow up | 0.004 | 0.057 | 0.007 |
Note. Table values indicate the Bias, RMSE, and Average SE Change between the original dataset and the dataset containing collapsed categories.
Practical Implications for Researchers
This study examines the impact category collapse has on IRT parameter recovery and IRT data-model fit. Our study expands the literature by exploring parameter recovery and data-model fit for the Generalized Partial Credit IRT model along with the Graded Response IRT model.
In practice, since the true data generating model is unknown, the candidate IRT models (e.g., GRM or GPCM) are often selected based on item types. For instance, data collected from a Likert-type item would be appropriate for GRM, while scores from a constructed response item would be appropriate for GPCM. We provide the following example for researchers to consider when deciding which candidate IRT model to use.
Consider an assessment scored from 0 to 4 where students are asked to solve a constructed response math item. When scoring student work, students who earn a 3 have also successfully completed enough work to earn a score of 1 and 2. This assumption leads us to recommend the GPCM as a candidate IRT model. In contrast, consider a Likert-style survey item with responses “Disagree,” “Neutral,” and “Agree.” This would lead us to recommend the GRM as a candidate IRT model.
We caution researchers against applying both GRM and GPCM to their data and selecting the one with the best-fitting results because we need to interpret model parameters that align with the specific item types. Researchers can use a statistical test such as the test to confirm if their proposed IRT model fits their data prior to data collapse. We strongly recommend that researchers confirm that their observed item response data fit their proposed IRT model prior to category collapse.
In sum, we provide the following recommendations to practitioners: Firstly, if the observed item response data comes from Likert scale items, research practitioners can fit a GRM to a collapsed dataset. There was no significant data-model misfit introduced for any items containing collapsed categories. Secondly, if GRM is used, practitioners may use either the 2.5% or 5% endorsement threshold when deciding to collapse adjacent response categories. Practitioners should collapse the targeted response category down into the next lowest adjacent category. This combination resulted in the least bias in recovered IRT person and item parameters.
Thirdly, if the observed item response data best fits a GPCM we caution against fitting a GPCM to a collapsed dataset. This process introduced significant data-model misfit for larger sample sizes along with substantial relative bias in recovered person and item parameters. Lastly, if researchers wish to collapse categories when using GPCM we recommend that researchers collapse the target response category up into the next highest response category. This process still produces significant relative bias values in recovered item and person parameters, but the bias is lower when compared to collapsing down.
Conclusion
The purpose of this study is to extend the literature on category collapse by investigating, through rigorous Monte Carlo simulations, if category collapse can be justified when utilizing the Graded Response (GRM) and Generalized Partial Credit (GPCM) Item Response Theory (IRT) models. From our extensive simulation study, we concluded that when using the Graded Responses model adjacent response categories can be combined without biasing IRT parameter estimation. In contrast, when using a Generalized Partial Credit model with data containing collapsed categories IRT parameters were not well recovered and significant data-model misfit was introduced into the analysis.
This is an open access article distributed under the terms of the Creative Commons Attribution License (