Non-cognitive constructs have long been of interest in psychological research and have traditionally been measured using single-stimulus formats (SS), such as the Likert format. However, this format has been known to suffer from response biases such as acquiescence and faking, which can undermine reliability, validity, and variability of the observed scores (Salgado, 2016), and alter the item covariance structure (McCrae et al., 2001). A solution that has shown promise in solving some of these problems is the forced-choice (FC) format. This format increases criterion-related validity (Salgado & Táuriz, 2014) and reduces both faking (Cao & Drasgow, 2019) and other response biases (Kreitchmann et al., 2019). However, recent studies have emphasized that these advantages are only achievable if forced-choice questionnaires (FCQs) are carefully designed (Graña et al., 2025).
Given the numerous possible item combinations involved in block construction, several procedures have been developed to optimize the assembly process. However, to date, a comprehensive comparison of these methods is lacking. The present study aims to address this gap by evaluating the most effective method for assembling FCQs from a SS item bank. The comparison is limited to the two options available in the software at the time of writing, namely the genetic algorithm (GA; Kreitchmann et al., 2022) and the simulated annealing algorithm (SA; Li et al., 2022). While a linear programming approach is possible, it can be substantially more computationally demanding and should be explored in detail in future research. Other heuristics, such as ant colony optimization, have not yet been applied. Therefore, a systematic comparison of available methods is needed to guide researchers in assembling FCQs, especially since GA is implemented in a Shiny app1 and SA is available in the autoFC R package.2
Forced-Choice Questionnaire Design and Modeling
In recent years, there has been a growing trend toward the development and use of FCQs to assess non-cognitive constructs (Lee et al., 2025). The FC format can be distinguished from SS format in that a choice must be made among the alternatives rather than rating each statement. A commonly used example is a FC block consisting of item pairs, in which two SS items are presented together and the respondent is asked to choose one over the other (see Figure 1; for a comprehensive review, see Hontangas et al., 2015).
Figure 1
Examples of Non-Cognitive Questionnaire Formats
Despite the advantages of FCQs over SS, this response format can introduce fully or partially ipsative scores. Ipsativity refers to the interdependence among trait scores, meaning that if a person scores higher on one trait, they must score lower on another. This can affect validity. For example, in a purely ipsative FCQ, the validity coefficients of all measured traits with respect to a given external criterion will sum (and average) to zero (Hicks, 1970). Full rank of the effective FCQ loading structure is a necessary condition for achieving non-ipsative scores, as noted in Brown (2016). Within the Thurstonian item response theory (IRT) framework, this condition is typically expressed in terms of the factor loading matrix. For instance, the FCQ loading matrix becomes rank-deficient when the loadings within every block or within every dimension are equal. A straightforward way to avoid this issue is to combine items with factor loadings of opposite sign within each block. Under alternative IRT parameterizations, however, the same rank condition can be expressed in terms of item scale parameters rather than factor loadings. Morillo (2018, p. 74) further illustrates that under such parameterizations, the matrix can also become rank-deficient when the scale parameters of the two dimensions represented in a block maintain a constant ratio across all blocks measuring the same pair of dimensions. While these conditions are violated in classical test theory scoring using blocks of equally keyed items (with equal weights for all items), under IRT modeling, where scale parameters are allowed to vary, such violations occur only on rare occasions. Some available IRT models allow researchers formally characterize the response processes underlying FC formats and obtain non-ipsative scores (e.g., Brown & Maydeu-Olivares, 2011; Morillo et al., 2016; Stark et al., 2005).
Ipsativity is therefore a property of the scoring method rather than of the FC format itself. Hence, ipsativity can be addressed with models such as the Multi-Unidimensional Pairwise Preference Two-Parameter Logistic (MUPP-2PL; Morillo et al., 2016) or the Thurstonian IRT for FC data (TIRT; Brown & Maydeu-Olivares, 2011). For binary FC comparisons, these models yield nearly equivalent response probabilities, although differences increase for block sizes greater than two. Test assembly assumes that item parameters are invariant when moving from SS to pairwise FC administrations. Empirical evidence supports approximate invariance: Lin and Brown (2017) found TIRT parameters largely stable across block compositions (quads vs. triplets), while Morillo et al. (2019) reported that MUPP-2PL parameters from FC blocks closely matched those from graded-scale formats. In this study, we adopt the MUPP-2PL framework, for which invariance between Likert-type and FC formats has been directly evaluated, showing correlations above .90 across formats.
Multi-Unidimensional Pairwise Preference Two-Parameter Logistic Model
The MUPP-2PL model can be used in dichotomous FC blocks, where the probability of agreement with response option is modeled following a 2PL function. The model includes an invariance assumption, which states that item parameters remain constant regardless of the response format (i.e., FC vs. Likert) and the within-block context (which item is paired with another). Therefore, and in Equation (1) should be the same for 2PL items and in the FC block. The block characteristic function is:
1
2
where indicates that the respondent selected item (the first item in the pair); represents a person’s position on each of the latent traits measured by the FCQ; and are the coordinates of for items and ; and are the scale parameters; is the block intercept parameter; and and are the location parameters. The sign of the scale (discrimination) parameter determines the item’s polarity. A positive scale parameter signifies a positively keyed item, where higher trait levels are associated with greater agreement. Conversely, a negative scale parameter indicates a negatively keyed item, where higher trait levels correspond to lower agreement. If both items have the same sign, either positive or negative, the block is considered homopolar or equally keyed; otherwise, it is called heteropolar or unequally keyed. In any case, it must be ensured that the resulting factor loading matrix is of full rank so that non-ipsative scores can be obtained.
Forced-Choice Questionnaire Optimal Assembly
One common approach for designing a FCQ is to pair items from a SS item bank to form blocks. This process can yield thousands of potential FCQs with varying levels of reliability. Because reliability is essential for validity, assembling blocks without considering item properties may result in suboptimal tests. With the growing use of FCQs, the need for optimal assembly procedures has become increasingly important.
Social Desirability
One important consideration in the design of FCQs is the control of social desirability (SD). If SD matching (SDM) is not carefully considered during block construction, differences in item SDs are likely to emerge. This is especially true for heteropolar pairs. Such differences may compromise the validity of the questionnaire scores by making it easier for respondents to select socially desirable options instead of those that reflect their true traits. Matching items by their level of SD is a widely recommended strategy for reducing faking in FCQs (Cao & Drasgow, 2019; Pavlov et al., 2021).
One approach for matching items based on SD involves convening an expert committee to rate the desirability level of each item. The average desirability rating is then calculated for each item across all raters and items are matched by minimizing the absolute difference between items’ mean desirability values. Items are considered well-matched when the difference in SD falls below a predetermined cutoff. For homopolar blocks, a common cutoff is 0.50 on a 5-point scale, whereas for heteropolar blocks, the cutoff typically needs to be relaxed in order to ensure that a sufficient number of valid heteropolar pairs can be formed (Graña et al., 2025; Li et al., 2025).
Scale Parameters
The ability to obtain normative scoring from FCQ responses through IRT relies on the differential weighting of responses (Bürkner, 2022), which is driven by the scale parameters. Heteropolar blocks naturally lead to this differential weighting. However, this is a topic of debate in FCQ design since pairing them can be tricky and increase cognitive demand. On the one hand, studies (e.g., Brown & Maydeu-Olivares, 2011; Frick et al., 2023) suggest that including heteropolar blocks can enhance estimation accuracy and validity. On the other hand, recent empirical findings (Graña et al., 2025) question whether heteropolar blocks are actually necessary. These authors suggest that if FCQs are assembled with homopolar blocks that differ in scale parameters, heteropolar blocks are not necessary. This is an ongoing debate, with many authors suggesting that including approximately 15–40% heteropolar blocks can enhance the psychometric properties of FCQs, even if it means accepting a minor compromise in SD (Lee et al., 2022; Li et al., 2025). Thus, achieving faking resistance requires careful SDM, especially when heteropolar blocks are included.
Algorithms for Optimal Assembly
Simulated Annealing Algorithm
The simulated annealing algorithm is a heuristic optimization method inspired by the physical annealing of solids. It operates in two steps (Kirkpatrick et al., 1983): first, the system’s temperature is raised to a maximum, then gradually decreased until a minimum is reached, minimizing the system’s energy, which corresponds to the cost of a solution. A key feature of SA is its ability to accept worse solutions to escape local optima. Li et al. (2022) implemented SA to assemble FCQs in the autoFC R package. The procedure begins with a user-defined blueprint specifying the number of blocks, block size, trait composition, and keying constraints. Each item is characterized by numerical attributes, such as SD or factor loadings, and the algorithm computes a weighted composite energy for each block, combining block-level indices for each attribute using user-specified weights so that higher absolute values reflect more desirable configurations. Heteropolar blocks are not included in the energy calculation; instead, the user defines specific trait combinations, and blocks meeting these conditions are randomly selected. Starting from a random admissible assembly, the algorithm iteratively swaps or replaces items, accepting changes that lower energy while occasionally allowing higher-energy solutions early on to avoid local optima. As the temperature decreases, the search converges on the lowest-energy arrangement, yielding FC blocks that satisfy the psychometric constraints. For a detailed tutorial, see Li et al. (2025). Version 0.2.0.1002 of the autoFC package was used in this study.
Genetic Algorithm
Genetic algorithms are heuristic optimization methods inspired by principles of population genetics. Kreitchmann et al. (2022) adapted the GA for FCQ assembly to maximize the marginal reliability of selected blocks. In this approach, new candidate blocks are generated using a node histogram-based sampling algorithm, which constructs probabilistic models from the genotypes of previous generations. Each new genotype is formed in two steps: first, a portion of the parent genotype is directly passed to the offspring as a template; second, the remaining elements are sampled from a conditional probability distribution capturing dependencies observed in prior generations, with a mutation factor added as noise. Candidates are then evaluated against their parents based on constraint compliance and the objective function, which is to maximize block reliability, and the best candidates advance to the next generation. Block content constraints are represented in a matrix (C), where each cell is 1 if a pair of items can be combined into a block based on content criteria and 0 if not. The node histogram records the frequency of item selection within a generation. This iterative process continues until convergence, producing FCQ blocks that satisfy psychometric constraints and maximize reliability. More details on the algorithm can be found in Kreitchmann et al. (2022).
The Present Study
To date, there has been no comprehensive evaluation of existing methods for the optimal assembly of FCQs. The purpose of this study is to compare such methods: the GA, the SA algorithm, in two approaches, one as blueprint and the other optimizing the parameter differences, and a brute-force (BF) search, focusing on the reliability of the assembled questionnaires’ scores and computational cost. The two SA variants differ in their optimization criterion; the former employs the basic functionality of the package, incorporating only content and SDM constraints, without considering item parameters, whereas the latter further defines an additional criterion that seeks to maximize differences in the scale parameters within each item pair. This study makes a novel contribution by incorporating SD constraints into GA and exploring how its relationship with item parameters may affect the block assembly process. We conducted a simulation study evaluating the four methods and an empirical illustration. We hypothesize that: (1) GA will perform best in terms of reliability, as it incorporates the reliability of the assembled blocks as the objective function to optimize, and be followed by the SA with scale parameter, , optimization method, SA blueprint, and BF; (2) there should be no significant differences in the inclusion or exclusion of heteropolar blocks; (3) a higher correlation of with is expected to reduce performance; and (4) SDM will negatively affect trait recovery. These hypotheses were not preregistered.
Method
Simulation Design and Data Generation
Three factors were systematically manipulated in the simulation study. For clarity, we categorize these factors into two groups, block factors, which relate to the FCQ construction, and an item factor, which pertains to the construction of the item banks. The block factors are: 1) questionnaire length (40, 80), and 2) use of SDM (Yes, No). The item factor is: 3) degree of correlation between scale parameter, , and of the positively keyed bank, which relates to the types of blocks formed (homopolar vs. heteropolar), and results in four levels ( with 25% heteropolar blocks, with 0% heteropolar blocks, with 0% heteropolar blocks, with 0% heteropolar blocks). Hereafter, “+” indicates items that are positively keyed to the trait, and “-” indicates items that are negatively keyed. All factors are fully crossed, resulting in 16 conditions.
First, the questionnaire length factor determines how many pairwise blocks are formed. We established two levels, 80 and 40 blocks. Second, the SDM factor indicates whether item pairs were matched based on their SD ratings, with two levels, yes (matching applied) and no (matching not applied). When SDM is applied, the absolute difference between two items’ ratings was calculated; if this difference exceeded a predefined cutoff, the items were not eligible for pairing. We established a cutoff of 0.5 for homopolar blocks to ensure a stricter level of matching. However, since it is more difficult to find heteropolar blocks that match in SD (Graña et al., 2025), the cutoff was relaxed to 0.75 for heteropolar blocks.
Since the third factor pertains to the generation of the item banks, we will describe the item bank generation process together. One five-dimensional bank of 320 SS items was generated for each condition and replication, as to imitate personality item pools, such as the International Personality Item Pool (IPIP; Goldberg, 1999). Three prototypical item banks were created. The main difference among them lies in the degree of association between the scale parameter, , and of the positively keyed items. All banks were balanced with 64 items per trait; in the mixed keyed bank, each trait had 32 positive and negative items. Each simulation condition used a separate item bank, corresponding to one of the three bank types described below.
The item banks differed in the degree of correlation between and , as well as in block polarity, resulting in four distinct categories (Table 1). Bank 1 was used for Categories 1 and 2 (and Levels 1 and 2 of the third simulation factor). It included both positively keyed items and negatively keyed items, with a small correlation between and for both types () and a naturally high correlation across keys (). Using this bank, Category 1 formed a FCQ with 25% heteropolar blocks, while Category 2 included only homopolar blocks (0% heteropolar).
Table 1
Parameter Simulation Specifications
| Bank 1 | Bank 2 | Bank 3 | ||
|---|---|---|---|---|
| Parameter | (+) | (-) | (+) | (+) |
| N(1.5,0.5) | N(-1.5,0.5) | N(1.5,0.5) truncated at 0 | N(1.5,0.5) truncated at 0 | |
| N(-0.5,0.8) truncated at -3 | N(-1,0.8) truncated at -3 | N(-0.5,0.8) | N(-0.5,0.8) | |
| N(4,0.5) truncated at 1 and 5 | N(2,0.5) truncated at 1 and 5 | N(4,0.5) truncated at 1 and 5 | N(4,0.5) truncated at 1 and 5 | |
| .20 | .20 | .50 | .80 | |
| .30 | -.30 | .30 | .30 | |
| .00 | .00 | .00 | .00 | |
Note. = scale parameter; = block intercept parameter; = social desirability of each item; = correlation between and ; = correlation between and ; = correlation between and .
Bank 1, which reflects a more realistic scenario, is expected to pose less difficulty in assembling reliable tests due to the low correlation between and . To explore more challenging conditions, we included two additional item banks. Levels 3 and 4 of the third simulation factor (and Categories 3 and 4) correspond to Banks 2 and 3, respectively; both consist of positively keyed items and therefore can form only homopolar blocks. Bank 2 was generated with a moderate correlation between the scale parameter, , and (), while Bank 3 has a high correlation (), making optimal block matching more difficult. The values were sampled this way to represent that, in real contexts, when traits are scored in the socially desirable direction (i.e., conscientiousness, emotional stability), positively keyed items tend to have higher desirability. The values in these distributions were primarily based on the empirical distributions of , , and reported in the publicly available datasets from Johnson (2014) and Hughes et al. (2021), while also considering that positively keyed items tend to exhibit higher SD values than negatively keyed items (Graña et al., 2025; Li et al., 2025). Under our scoring convention, positively keyed items for each trait were defined as the socially desirable direction (e.g., for items measuring Neuroticism, a positively keyed item has a lower SD rating). These choices were made to represent realistic item parameters and SD behavior. This also includes the correlation between and (approximately 0.30 for both positively and negatively keyed items). The correlation between and was kept at zero so that would only be linked to , allowing us to analyze the impact of this variable.
The structure of the simulation is as follows. First, a SS item bank was generated for each condition and replication. Second, a FCQ was constructed with each algorithm using the SS parameters. Then, a binary FC response dataset of 5,000 respondents was simulated using the MUPP-2PL for each assembled FCQ. Finally, the MUPP-2PL was estimated, and trait recovery was assessed for each questionnaire. Trait estimates () were obtained as maximum a posteriori scores with the Metropolis-Hastings Robbins-Monro algorithm. These analyses were conducted using the mirt R package (Chalmers, 2012). This process was replicated 50 times. We controlled for the following content constraints: (1) block multidimensionality (i.e., each block had to include items measuring two different traits), and (2) trait balance across the selected blocks (i.e., each trait had to be represented by the same number of items). Item repetition was not allowed. During FCQ assembly, we verified that content and polarity constraints were met and that the distribution of blocks and items across traits remained generally balanced, allowing a fair comparison of the algorithms. Additionally, we conducted simulation checks to evaluate if any of the resulting scale parameter matrices after FCQ assembly were rank restricted by analyzing the least singular value of such matrices. Across all conditions, assembly algorithms, and replications, the least singular value was strictly greater than zero, indicating that none of the matrices were rank-deficient. We recorded the time spent in seconds for each assembly algorithm. In the case of BF, 100 questionnaires that met the specified constraints and SD requirements were randomly formed and the one with the highest reliability was selected. The R code used for the analysis and the empirical study can be found at Sorrel et al. (2026). The repository also includes a document detailing all algorithm specifications. All procedures were executed with a 2.50 GHz Intel Core i9-11900 CPU and 32 GB of RAM.
Measures of Trait Recovery
To compare the assembly methods, the main dependent variable was trait score recovery. Specifically, we computed for each replica and trait: (1) the true reliability, calculated using the squared correlation between estimated and true (); (2) the root mean square error of , both overall (; Equation 3) and conditional to the true (). The conditional was calculated within intervals of the true to examine how estimation accuracy varies across the latent continuum. Individuals were grouped into bins of 0.5 spanning from -2 to 2, and within each bin the RMSE was computed from the squared estimation errors of all individuals in that group. Additionally, an indicator of ipsativity was included, consisting of (3) the average trait correlation bias (; Equation 4).
3
4
where and are the estimated and true trait correlation matrices, respectively. We used the real-world correlations from the NEO-PI-R (Costa & McCrae, 1992). In the case of fully ipsative scores, a negative bias of would be expected for , with D representing the number of traits (Hicks, 1970). Dependent Variables (1) and (2) were calculated separately for each trait and then averaged across all five traits, whereas (3) was calculated by extracting the non-diagonal elements of and , applying Fisher’s Z-transformation to each correlation, computing the Z-differences, averaging these differences, and then back-transforming the average to the correlation metric (Corey et al., 1998). Results of the overall and of the four-way univariate analyses of variance (ANOVA) for each dependent variable, where algorithm was treated as a within-condition factor and the simulation conditions as between-condition factors, can be found in Tables S2 and S3 in the Supplementary Material (see Escudero et al., 2026). Partial eta-squared () values higher than .14 were considered as relevant effects (Cohen, 1988). Results for (shown in Figure S1, Escudero et al., 2026) are omitted from the main text, as the conclusions are the same as for . Therefore, we focus on the latter, which is a more commonly used metric. ANOVA results were used to guide the interpretation of the findings.
Results
Algorithm Efficiency
The GA is notably influenced by the questionnaire length, with longer questionnaires resulting in increased duration, as shown in Table S1 in the Supplementary Material (see Escudero et al., 2026). Assembling questionnaires of 80 and 40 blocks took an average of 4.90 and 3.21 minutes, respectively. In contrast, the other algorithms show minimal sensitivity to questionnaire length. Both SA methods are typically completed in a few seconds. However, SA has higher skewness and kurtosis, as we implemented an iterative process that reruns the algorithm until the target design is achieved. Across all conditions and replications, the constraint on the number of heteropolar blocks was always satisfied. In a small number of replications of the SA method, however, a few blocks did not meet the SDM constraint. Out of the 400 replications with SDM, there were 28 with 1 block affected and 2 with 2 blocks affected for the SA blueprint, and SA with Scale Parameter optimization showed 54 with 1 block affected and 1 with 2 blocks affected. The small number of affected blocks (1 or 2 out of 40 or 80) suggests that the impact is negligible. In any case, this implies a potential advantage in reliability and a disadvantage in ipsativity for these replications.
Recovery of Trait Parameters
Table 2 presents the marginal results for the and of the four assembly methods. GA consistently yields the highest reliability (), followed by SA with optimization (), BF (), and SA blueprint (). These differences indicate a large effect size (). All simulation conditions are relevant, as seen in Figure 2. The most relevant factor was the questionnaire length, which can be expected, as longer questionnaires enhance reliability (). Shorter tests show lower reliability overall, with the same relative patterns across correlations, heteropolar proportions, and algorithms. For instance, controlling SD at with 0% heteropolar blocks, GA’s reliability is .71 for Length 40, compared to .80 for Length 80. The next relevant factor was the degree of relation between the scale parameter, , and in the item bank and having heteropolar blocks or not (). The highest reliability is achieved with mixed banks forming questionnaires containing 25% heteropolar blocks; for example, for Length 80, controlling for SD at , GA achieves .88 vs. .84 in the same condition with 0% heteropolar blocks. As the correlation between the scale parameter and increases, reliability decreases. The same tendency can be seen in both test lengths. Specifically, the condition with the lowest reliability results is the positively keyed item bank with in the 40-block test length and it is where the most differences between algorithms are seen: GA = .71, = .68, BF = .65, and = .60. Although the use of SDM had the smallest effect size among the factors examined (), it still had a meaningful impact on reliability, consistently leading to lower values, due to the constraints it imposes on item pairing. Specifically, taking GA as an example, reliability was consistently lower when incorporating the desirability constraint, decreasing on average by .01 points compared to the same condition without considering SD, and reaching a decrease of up to .04 points in cases where there is a strong relationship between the scale parameter and . The same pattern was observed for the other procedures. Notable interactions were; algorithm with length (), algorithm with D-H () and SDM with D-H (). These interactions further prove that these factors are relevant in assembling FCQs, as the algorithm proves to be more important when the correlation between and is higher and whether you form heteropolar blocks or not, and that this same factor interacts with SDM. In all cases, the ordering of the methods described above is preserved.
Table 2
Average Trait Recovery Across 50 Replications for Questionnaires Assembled Using Each Algorithm in the Simulation Study
| | | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| 80 | with 25% het | Yes | .88 | .87 | .87 | .86 | .01 | .01 | .01 | .00 |
| No | .88 | .87 | .87 | .86 | .01 | .01 | .01 | .01 | ||
| with 0% het | Yes | .84 | .82 | .81 | .78 | -.05 | -.07 | -.10 | -.14 | |
| No | .84 | .83 | .81 | .78 | -.05 | -.06 | -.10 | -.13 | ||
| with 0% het | Yes | .83 | .82 | .79 | .76 | -.06 | -.07 | -.12 | -.17 | |
| No | .84 | .83 | .81 | .78 | -.05 | -.06 | -.09 | -.14 | ||
| with 0% het | Yes | .80 | .80 | .74 | .71 | -.10 | -.11 | -.21 | -.26 | |
| No | .83 | .83 | .80 | .78 | -.05 | -.06 | -.10 | -.14 | ||
| 40 | with 25% het | Yes | .80 | .78 | .78 | .76 | .01 | .01 | .00 | .00 |
| No | .81 | .78 | .79 | .77 | .02 | .01 | .01 | .00 | ||
| with 0% het | Yes | .76 | .71 | .71 | .66 | -.06 | -.12 | -.14 | -.20 | |
| No | .76 | .72 | .71 | .66 | -.05 | -.11 | -.13 | -.21 | ||
| with 0% het | Yes | .75 | .71 | .69 | .64 | -.07 | -.12 | -.17 | -.24 | |
| No | .76 | .72 | .71 | .66 | -.05 | -.12 | -.13 | -.20 | ||
| with 0% het | Yes | .71 | .68 | .65 | .60 | -.12 | -.17 | -.25 | -.32 | |
| No | .75 | .71 | .70 | .66 | -.06 | -.12 | -.14 | -.21 | ||
| Grand mean | .80 | .78 | .76 | .73 | -.04 | -.07 | -.10 | -.15 | ||
Note. Maximum values of and closest to zero are marked in bold. All standard deviations of and range around 0 and .03. D-H = degree of relation between and of the positively keyed item bank and possibility of forming heteropolar blocks; SDM = social desirability matching; GA = genetic algorithm; = simulated annealing with scale parameter, , optimization; BF = brute-force; SAbp = simulated annealing blueprint.
Figure 2
Squared Correlation Between and
Note. GA = genetic algorithm; = simulated annealing with scale parameter, , optimization; BF = brute-force; SAbp = simulated annealing blueprint; SDM = social desirability matching. Within every facet, the four algorithms are always displayed in the same left-to-right order.
As for , Figure 3 shows relevant differences across algorithms that align with the results found in . Generally, GA has lesser error, as it is closer to zero, and SA blueprint is the furthest method from zero. RMSE is lower for values near zero and increases at the extremes, as well as longer questionnaires reduce error, and greater differences among the algorithms are observed in the condition where in the 40-block questionnaire.
Figure 3
Average Conditional Root Mean Square Error
Note. GA = genetic algorithm; = simulated annealing with scale parameter, , optimization; BF = brute-force; SAbp = simulated annealing blueprint; SDM = social desirability matching.
Ipsativity Indicator
Regarding the ipsativity indicator (), several patterns emerge from the absolute values in Table 2. As seen in Table S3 in the Supplementary Material (see Escudero et al., 2026), the ANOVA results follow the same pattern as the results previously described. First, mixed banks with 25% heteropolar blocks produce negligible bias across all algorithms ( ranges from .00 to .01). In contrast, bias increases as the correlation between the scale parameter, , and grows and when only positively keyed items are used. For the most demanding condition (i.e., , 0% heteropolar blocks, SDM, and Length 40), the bias reaches -.12 for GA, -.17 for SA with optimization, -.25 for BF, and -.32 for SA blueprint. Across all conditions, GA consistently shows the lowest ipsativity levels, followed by SA with optimization, BF, and SA blueprint. This ordering is also reflected in the grand means: -.04 (GA), -.07 (), -.10 (BF), and -.15 (). These differences also have a large effect size (). Thus, while all methods show increasing bias, as the conditions are more demanding, GA systematically yields the least distorted correlation matrices.
Empirical Illustration
We applied GA, SA, both the blueprint method and with a scale parameter, , optimization, and BF to an empirical dataset consisting of 286 Likert-type mixed item bank (145 positively keyed) from the IPIP-NEO (Johnson, 2014). From this dataset, we drew a random sample of 1,000 U.S. participants aged between 19 and 25 years old with no missing data. The rater’s SD data for each item were obtained from Hughes et al. (2021). The correlation between the scale parameter, , and in this bank is .30 (with and ). We assembled four questionnaires for each optimization method, two with a total of 70 blocks, where one had all homopolar blocks and one with 20% heteropolar blocks (14 blocks), and two with 35 blocks, where one had all homopolar blocks and one with 20% heteropolar blocks (7 blocks). All questionnaires incorporated SDM using a 0.75 cutoff for homopolar blocks and 1.125 for heteropolar blocks, since SD ratings ranged from 1 to 7. Likert item parameters were obtained with a graded response model. For the estimation of marginal reliability used in the BF and GA methods, we used the empirical NEO-PI-R factor correlation matrix (Costa & McCrae, 1992). For algorithm comparison, we considered empirical reliability estimates. The procedure was as follows. First, 5,000 responses to the FCQ were simulated using the and parameters derived from estimates based on the Likert-format version, along with the estimated trait correlation matrix. Next, the MUPP-2PL model was fitted to the simulated data. Finally, empirical reliability was calculated using the empirical_rxx() function from the R package mirt.
As seen in Table 3, GA consistently provides the most reliable questionnaire scores (close to or above ), followed by SA with optimization, which sometimes performs the same as BF, and SA blueprint. Differences were smaller when assembling longer questionnaires.
Table 3
Average Trait Recovery Using Each Algorithm in the Empirical Illustration
| | ||||
|---|---|---|---|---|
| Condition | GA | BF | SAbp | |
| 70-blocks | ||||
| 20% heteropolar | .78 | .77 | .76 | .74 |
| 0% heteropolar | .79 | .76 | .75 | .74 |
| 35-blocks | ||||
| 20% heteropolar | .71 | .65 | .64 | .61 |
| 0% heteropolar | .70 | .65 | .63 | .59 |
Note. Maximum values of are marked in bold. GA = genetic algorithm; = simulated annealing with scale parameter, , optimization; BF = brute-force; SAbp = simulated annealing blueprint.
Discussion
The optimal assembly of non-cognitive questionnaires with adequate reliability has become increasingly relevant with the growing use of FC formats. Despite its importance, no previous study has systematically compared the performance of existing methods for the optimal assembly of FCQs. Therefore, the present study addresses this gap by evaluating the impact of four assembly methods on trait score recovery through a simulation study and an empirical illustration: GA (Kreitchmann et al., 2022), with an improvement on the application of SD constraints; two approaches of SA implemented in the autoFC package (Li et al., 2022), one using a blueprint method and the other optimizing the within-block difference in the parameters; and a BF random search. Two key aspects identified in the literature as critical for assembly are the use of heteropolar blocks (often recommended to avoid ipsativity issues) and the correlation between and , which can make optimal pairing difficult (i.e., pairing items with different , while keeping them matched in ; Lee et al., 2022; Li et al., 2025; Pavlov et al., 2021). These factors, along with the questionnaire length, were incorporated into the simulation design and empirical illustration.
As we hypothesized, among the methods tested, GA (Kreitchmann et al., 2022) produced systematically the most reliable questionnaire scores in both the simulation study and empirical illustration. This result can be expected, as this method explicitly optimizes the reliability of the assembled blocks. However, closely behind is the SA with parameter optimization which aligns with promising results found in Li et al. (2025). In terms of implementation, GA is more computationally demanding and slower, whereas SA is faster. The SA blueprint method performed similarly to or worse than the BF approach in the simulation study. This result is reasonable, as BF selects the questionnaire with the highest expected score reliability from the 100 generated, whereas the SA blueprint method does not include an explicit reliability optimization step. All examined factors significantly affected reliability, including questionnaire length, SDM, the degree of relation between and , and the inclusion of heteropolar blocks. As expected, longer questionnaires yielded higher reliability, and the impact of the assembly algorithm became more pronounced with increased length. GA showed the greatest advantage under the most demanding conditions. The results for the ipsativity indicator, bias in the recovery of the correlation matrix, preserve the same ordering of methods as observed for reliability. Ipsativity bias tends to increase when heteropolar blocks are absent, but its magnitude depends on the scenario. In many conditions, including some with 0% heteropolar blocks, bias remains minimal across algorithms, and only under more demanding scenarios such as high correlations between scale parameters and social desirability, shorter tests, or less optimal assembly algorithms does it reach larger values. Across all conditions, GA consistently shows the lowest bias, followed by SA with optimization, BF, and SA blueprint, highlighting that both test design and algorithm choice can influence the magnitude of distortion. This comparison of the algorithms in terms of this variable is informative, since none of them explicitly aims to minimize this bias, allowing for a fairer comparison.
In the simulation study, mixed keyed tests produced higher reliability than positively keyed tests alone. This further emphasizes that the incorporation of heteropolar blocks can be useful if paired correctly, supporting the findings of Brown and Maydeu-Olivares (2011) and Frick et al. (2023), but contrasts with those reported by Graña et al. (2025). This did not align with our second hypothesis. It should be noted that the criteria was more lenient when combining items with greater variability in SD in the case of heteropolar blocks, which may have contributed to the observed increase in reliability. Moreover, excluding heteropolar blocks reduces the search space. In the empirical example, this exclusion results in disregarding 12,392 item combinations. Reducing the search space also limits the flexibility to maximize reliability. If the applied researcher considers incorporating heteropolar blocks to be beneficial they can do so, as it can increase reliability, which can be achieved with only a few blocks. However, this can have a downside, since they are harder to form, especially when imposing other constraints such as SDM, as seen in the empirical illustration, when a smaller bank results in fewer viable heteropolar blocks. Such restrictions are particularly problematic in high-stakes contexts where item leakage may further reduce usable item combinations.
In line with our third hypothesis, the relationship between item scale parameters and SD also proved to be relevant. In banks consisting only of positively keyed items, lower correlations between and led to higher reliability. In contrast, high correlations (e.g., ) reduced reliability; this is more noticeable when matching for SD, likely because highly discriminating items were also highly socially desirable, making it harder to assemble blocks with high and minimal differences. These results indicate that the more difficult the SDM task is, the more the choice of method matters, with GA showing the best performance.
Regarding our fourth hypothesis, SDM tends to reduce reliability, which we hypothesized would happen because, as an additional constraint, it reduces reliability regardless of the algorithm. This effect is most pronounced in BF, whereas in GA and SA the loss in reliability is smaller. This decrease is typically less than .01, except for the most demanding condition (). Nevertheless, the difference remained small, and SDM remains important in practice, as it can be easily implemented and can enhance validity.
While the findings are promising, certain limitations must be considered and may inform future research efforts. A limitation of this study is its focus on pairwise blocks, leaving the evaluation of these assembly methods for larger block sizes to future research. This constraint also applies to GA, which currently does not support the formation of blocks containing more than two items. In this study, we rely on the assumption of measurement invariance, with supporting evidence provided by Morillo et al. (2019). Nevertheless, prior research acknowledges that this assumption may not hold universally (Kreitchmann et al., 2022) and may be influenced by block polarity (Graña et al., 2025). Accordingly, the reliance on measurement invariance represents a limitation of the current study. As any method that assembles blocks from SS item parameters inherently relies on this assumption, and because measurement invariance remains a critical issue, further empirical work is needed to delineate the cases where it may not be sustained. It may be valuable to explore alternative approaches such as psychometric networks to examine it in more depth (Abdelhamid et al., 2024; Jamison et al., 2024). Applied researchers interested in using the algorithms used in this study should consider this when assembling FCQs. Here, we focus on the MUPP-2PL model, although the proposed procedures can in principle be extended to the TIRT framework. The choice of model may have some impact, because even though MUPP-2PL and TIRT are nearly equivalent for binary FC data, they posit different response processes and may therefore yield slightly different block-level parameter estimates. Nevertheless, as our interest lies in the comparative performance of the block-assembly algorithms rather than in the absolute values of the parameters, the overall pattern of results is not expected to change, and the main conclusions should generalize to TIRT-based applications. Another potential direction for future research is to extend the GA approach by incorporating alternative reliability metrics beyond marginal reliability as the objective function. For instance, GA could be adapted to prioritize the reliability of a specific trait or apply weighted importance to certain traits over others or ensure a minimum level of reliability for all traits, rather than simply maximizing the average reliability across all traits. Additionally, an interesting direction for future research would be the incorporation of reliability in the optimization of energy in the autoFC package. In this regard, a normative-order indicator was used here as a measure of performance (reliability and similarity between estimated and true theta), although other classification-related metrics may be of interest in applied settings. These other metrics can be investigated in future research. In line with this further exploration and with the goal of supporting practical use, the R functions used in this study have been made available in an OSF repository, at Sorrel et al. (2026). This will facilitate their application, especially for potential users of GA, who previously had to rely on the Shiny app. It would be interesting for future research to examine whether other possible options for block assembly, such as linear programming or ant colony algorithms, might offer better performance under certain conditions.
Conclusion
In conclusion, we recommend using the GA for assembling FCQs, as it consistently produces high-quality solutions by accounting for key psychometric properties. Its advantages are particularly evident in challenging scenarios such as short questionnaires, high correlation between and , and the need to match items on SD. Although SDM may slightly reduce reliability due to its restrictive nature, it remains essential for minimizing response bias. As the forced-choice format continues to gain popularity over traditional SS formats, the use of optimization algorithms becomes increasingly important. These methods enable fast and reliable questionnaire assembly while accommodating additional constraints such as SD control and the inclusion of heteropolar blocks, making them an essential tool for advancing non-cognitive assessment.
This is an open access article distributed under the terms of the