Empirical Ensemble Equating Under the NEAT Design Inspired by Machine Learning Ideology

This study proposes an empirical ensemble equating (3E) approach that collectively selects, adopts, weighs, and combines outputs from different sources to take and combine advantage of equating techniques in various score intervals. The ensemble idea was demonstrated and tailored to the Non-Equivalent groups with Anchor Test (NEAT) equating. A simulation study based on several published settings was conducted. Three outcome measures – average bias, its absolute value, and root mean square difference – were used to evaluate the selected methods’ performance. The 3E approach outperformed other counterparts in most given conditions, while the cautions, such as tuning weights and assuming possible scenarios for using the proposed approach were also addressed.

mechanisms and item bank construction. From a measurement perspective, different assessment forms should be built on an identical set of content and statistical specifica tions for consistency purposes. Further, statistical models are adopted to support the exchangeability of scores across the forms; this process is generally called equating, allowing computations of scores projected from one form to the other.
Among many equating designs, the Non-Equivalent groups with Anchor Test (NEAT) is a highly, if not the most, popular one widely adopted in research and practice. In an application of the NEAT design, a new test x form is equated to an old test y form, a sample takes x from Group X, and a sample takes y from Group Y. In addition, an anchor test is taken by both groups and allows one to study the difference in ability between Group X and Group Y. Group X's true response data on y form and Group Y's true response data on x form are not observable, as they do not actually happen in the administration; this makes the quality evaluation of equating difficult, as no true values are available for the comparative purpose. Therefore, most studies investigating the performance of equating methods are simulation-based (e.g., Andersson & Wiberg, 2017;Moses & Holland, 2010;Sinharay & Holland, 2010). That is, researchers provide empirical conditions to find if specific methods yield better results than other counterparts; the findings are then used to assist method selections.
Statistical techniques for equating are about transformations of both modeling pa rameters and item responses, including the ones based on equipercentile equating, linear equating methods, item response theory (IRT) observed-score and true score equating, local equating (van der Linden, 2011), Levine nonlinear method, Kernel equating (KE), and others (see Kolen & Brennan, 2004 for details). Specifically, a post-stratification (PSE), Levine observed-score linear, and chained equating (CE) methods are typically used in KE when the NEAT design is present (von Davier et al., 2004). However, these techniques are not consistently performing better than others. In fact, the perform ance depends on the settings of actual tasks and different score ranges. For instance, Livingston and Kim (2009) found that differences between equating methods in accuracy were small for raw scores near the median of the distribution but large for scores far from the median, and the circle-arc method had higher accuracy in the upper and lower tails of the score distribution compared to mean equating in small samples. Kim and Livingston (2010) show that, in small sample scenarios, CE produced the most reliable results for low scores, while circle-arc ones were better choices in the upper half of the score distribution.

Ensemble Learning
As a powerful technique, ensemble learning (EL) functions like its name suggests: utiliz ing multiple models to improve the reliability and accuracy of specific machine-learning predictions. The idea of the ensemble is collectively selecting, adopting, weighing, and combining outputs from different sources. Without loss of generality, in a classification task, techniques such as logistic regression, support vector machine, random forests, and neural network are all set into EL to improve the stability of the overall performance. Tremendous studies across fields show that EL is frequently more reliable than individual models. For instance, Borovkova and Tsiamas (2019) classify different companies in the stock market via EL; Lessmann and colleagues (2021) propose an EL framework to support marketing decision-making; Priore and colleagues (2018) construct scheduling of flexible manufacturing systems using ensemble methods. Unsurprisingly, EL often ranks at the top in machine-learning competitions such as Kaggle (Kumar & Mayank, 2020;Stamp et al., 2021).
EL can be framed in multiple ways. The simplest one is averaging the outputs of dif ferent models, while complex ones devising weights and adaptive algorithms to empower the engineer. The concept of EL has been extended to a broader sense, meaning it is not limited to models, but also data and hypotheses. In this paper, we limit EL in the context of a modeling ensemble, where each model is termed a "learner". There are two EL sub-types: sequential EL and parallel EL. The former considers the dependence between learners, each of which is exploited sequentially to obtain more accurate predictions. To illustrate with a classification example again, mislabeled cases have their weights adjusted while the weights for properly labeled sets stay unchanged. Each time a new learner is generated, the weights are updated to improve the classification performance. On the other hand, parallel EL drives learners in parallel. When rendering parallel EL, the idea is to exploit the learners' independence, as the overall error rates can be reduced by drawing on "good" learners' strengths and offsetting "bad" learners' weaknesses. Figure 1 shows a simple EL: four learners (i.e., classification techniques such as logistic regression, support vector machine, and so on) are used to predict a binary variable with a value of red or blue, while the third one yields a different label (blue) to others (red). If one uses majority vote as the ensemble schema, the aggregated result is colored red, as three learners endorse red and only one endorses blue.

A Simple Ensemble Learning Using Majority Vote
It's self-evident that different weighting schemes can lead to unidentical conclusions, even if the learners are identical in two EL models. "Simple weighted average (SWA)" is that the weights are proportional to the precisions of each learner. "Weight proportional to the square of the precision (SqrWA)" squares the precisions to obtain weights. In contrast, "weight proportional to the precision's powers of N (PrWA)" further extends square to an arbitrary integer N. Other schemes, such as considering data collection time (i.e., "age" of the data) and polynomial functions on data variance, are also available but not applicable to the present study (see Wagner, 1975, p. 289). Let's consider a situation where the precisions are wrapped into values larger than 1 (the inversed effect exhibits when the precisions are presented as ratios or percentages); the three weighting schemes (SWA, SqrWA, and PrWA) incrementally entrust the learners that perform the best at a specific estimate more; for example, the same precision will be given more considerable weight in SqrWA than in SWA. An extreme choice is brutally picking the best one and neglecting others; that said, all non-optimal learners receive zeros and 100% for the optimal one when calculating weights. EL has been applied to different areas and inquiries in educational and psychological studies. Ragab and colleagues (2021) use EL algorithms to predict student failure and enable customized educational paths; Abidi and colleagues (2020) adopt ensemble clas sifiers to quantify academic procrastination through big data assimilation; Premalatha and Sujatha (2021) predict the employment status of graduates in higher educational institutions via EL; Pearson and colleagues (2019) estimate treatment outcomes following an internet intervention for depression through a machine learning ensemble. These successful applications primarily lie in prediction and classification; engrafting EL to equating tasks remains unknown such that the topic per se is practically beneficial and methodologically meaningful to the field.

Method
The method section outlines the steps involved in constructing the proposed ensemble approach and highlights the rationales and consequences of this method through a walkthrough case study. This case study uses the scenario depicted in Figure 2 for illustration purposes. The first part of Figure 2 displays the true scores ranging from 20 to 23, along with the estimated scores generated by three equating models (referred to as learners in this study) represented by M1, M2, and M3. The second and third parts of Figure 2 present the biases (i.e., the equating result minus the true score) and their absolute values, along with their averaged values highlighted in the last row. It can be observed that the lowest values of the averaged bias and absolute bias are 0.15 and 0.25, respectively (see the last row in the second and third part of Figure 2). Thus, a better approach would ideally produce values lower than these two numbers.

The Steps of a Walkthrough Case Using Simple Weighted Average Schema
The proposed approach employs a "simple weighted average" schema and requires abso lute bias values to generate comparable measures, enabling the calculation of relative contributions to ensemble weights. Theoretically, models with smaller absolute biases should be trusted more and given greater weight in the final ensemble. Thus, contribu tions to the weights should be inversely related to absolute biases. In the fourth part of Figure 2, the inversions of the absolute biases (e.g., 1.0/0.5, 1.0/0.3, and 1.0/0.4 in the first row) are calculated and summed across each row. These inversions, as shown in the fourth part of Figure 2, are divided by their sums for each row (e.g., 2.0/7.8, 3.3/7.8, and 2.5/7.8 in the first row) to create ensemble weights, which are listed in the fifth part of Figure 2. Consequently, the sum of the weights in each row equals 1, as shown in the last column.
Finally, the weights are applied to the corresponding estimated scores presented in the first part of Figure 2. The ensemble equating is completed by summing the weighted scores, resulting in the last column in the sixth part of Figure 2. It is straightforward to calculate that the average bias and absolute bias values of the ensemble score, as seen in Figure 2 (across the case's 20-23 range), are 0.05 and 0.15. As expected, these aggregated accuracy measures are both lower than those of any individual model, as shown in the last row of the second and third parts of Figure 2. This indicates that the ensemble approach effectively improves the equating accuracy compared to relying on a single model, thus validating the proposed method.
The walkthrough case in Figure 2 shows one scenario only, of which the result is not comprehensive enough to generalize. Figure 3 contains two more scenarios: the first one further amplifies the advantage, as all the aggregated absolute bias values for the three learners (0.325, 0.725, and 0.75) are larger than that of the ensemble score (0.26); while the second one, although not producing the optimal results (0.56) when compared with other individual learners (0.5, 0.75, and 0.75), remains robust as it outperforms many counterparts.
As introduced above, different weighting schemes likely result in inconsistent esti mates. If one uses the "weight proportional to the square of the value/precision (SqrWA)" scheme, the white cells in the fourth part of Figure 2 should be squared before summing, and the rest of the calculations, follow the same flow. The PrWA_N scheme calculates the Nth power for the reversed absolute bias values in the fourth part of Figure 2 to increase the impact of top-performing learners in the ensemble procedure.
In practice, however, true scores are unknown in an equating setting. Therefore, constructing an ensemble equating model demands a mechanism to account for the plausible variability in observed responses. That said, this mechanism should deliver weights for each learner. Based on the ideology of the walkthrough case, we propose an empirical ensemble equating (3E) approach to handling the NEAT design's inquiry.
Like power analysis in complex scenarios where mathematical deriving fails to provide viable solutions, the 3E approach is simulation-based and rooted in empirical estimates from item response theory (IRT). Let {x, y, anchor} be observed responses from a NEAT design and, correspondingly, β, θ be estimates of item parameters and latent traits via a IRT model (e.g., three-parameter logistic model) where the observed respon ses are fed. We adopt the famous "KBneat" dataset to demonstrate the 3E approach. This dataset contains responses for two forms (one for each group) of a 36-item NEAT-based examination, while 12 anchor items were taken by both groups (Kolen & Brennan, 2004).
In this study, eight learners were used for both comparative purposes and 3E construction, including linear equating methods (the Tucker linear equating and the Two Possible Scenarios in the Use of Ensemble Learning chained linear equating), equipercentile equating methods (the equipercentile equating using frequency estimation method with log-linear smoothing and the equipercentile equating using a chained method with log-linear smoothing), mean equating methods (the Tucker mean equating and the chained mean equating), and the circle-arc ones (the Tucker circle-arc equating and the chained circle-arc equating). These learners have been substantially studied and proved useful in many environments. Employing a diverse set of learners allows the ensemble approach to benefit from collective wisdom, as it can potentially reduce the impact of any shortcomings associated with a single method. By combining the outputs of these learners, the ensemble method can achieve better overall performance, ultimately improving the accuracy and robustness of the equating process.
Each learner was applied to "KBneat", and the equated results were standardized to serve as the ability difference between the two groups: the mean of the standardized differences was 0.3. The "KBneat" data of both groups were calibrated via two separate three-parameter logistic (3-PL) IRT models, resulting in β X and β Y (see Kolen & Brennan, 2014, p. 203 for item parameter estimates). θ X is sampled from Normal (0, 1), while the θ Y is assumed to drift from θ X by 0.3 and therefore the distribution was set to Normal (0.3, 1.5). θ X was used with β Y in Group Y's 3-PL model to generate Group X's true responses in y form (called x.y), such that each learner's precision was calculated. 10,000 individuals per group were generated as the pool. To summarize, these simulation-based steps were functioning as a basis for learners' weights calculation.
The 3E approach is simulation-based, yet evaluating its performance demands a simulation study, too. Sample size directly affects random equating error, and different equating methods are suitable for different sample sizes. For example, when sample sizes are relatively small, the circle-arc equating and mean equating might be considered (Livingston & Kim, 2009). However, Diao and Keller (2020) suggested that sample sizes between 100 and 400 are sufficient for all classical equating methods. Sample sizes larger than 1000 (e.g., 1000,1500,2000) are not well investigated in previous equating research. To fully consider the impact of various sample sizes on the learners (traditional equating methods), the simulation study set the sample size per group equally to [200,500,1000,1500,2000]. For example, if sample size was 500, 500 rows of the observations were randomly drawn from {x, y, anchor, x.y}. As an empirical approach, it's reasonable to assume that item parameters functioned stable. However, the ability difference could vary from cohort to cohort, especially when the sample size was small. Therefore, in addition to 0.3, two more θ drifts-0.1 and 0.5-were used. Finally, the SWA, SqrWA, and PrWA (powers set to 5, 50, and 100) were deployed as weighting schemes of the 3E approach, which adjusted the impact of top-performing learners in the ensemble procedure. Each condition was replicated 100 times. Three measures were used according to the literature (i.e., Wolkowitz & Wright, 2019;Zeng, 1993)-the average bias and its absolute value (BIAS and AbsBIAS) and root mean square difference (RMSD): where SS was the sample size, and x . y p was the equated score of an individual examinee. The true responses of individual p from group X on test y (x . y p ) was generated through the (3E) approach as described previously. Based on the repeated samples, the measures were calculated by averaging over 100 repetitions. The analysis was implemented using R (Version 4.2.2 64-bit; R Core Team, 2016) and the R code is given in the Supplementary Materials.

Analysis
Since this study aims to explore the performance of the ensemble method and compare the ensemble method with traditional methods, rather than comparing among traditional methods (learners), in each repetition, we choose the learner with the highest equating accuracy as a reference. The reference method does not explicitly refer to a specific method, which may differ across conditions and repetition. Still, it is always the best learner, and the ensemble method is always compared with the best learner. The average equating errors for the reference method and five 3E approaches utiliz ing various weighting schemes across different conditions are displayed in Tables 1, 2, and 3. It is important to note that the reference method does not correspond to a specific equating method. Instead, it represents the method that produced the minimum equating error (i.e., the smallest absolute BIAS, RMSD, and BIAS values) among the eight learners in each repetition, and the values were calculated via averaging over all repetitions. It is crucial to emphasize that these bias measures are used to gauge the equating accuracy, with smaller values indicating higher precision in the equating process. As shown in Table 1, with the increase of the power in the weighting scheme for the 3E approaches, the smaller the absolute BIAS value, the higher the equating accuracy. This trend weakens until the power increases to 50, and the absolute BIAS value may no longer decrease. The difference in the absolute BIAS values among powers 1, 2, 5, and 50 is relatively large. The difference in the absolute BIAS values between powers 50 and 100 is fairly close. Their equating accuracy is higher than that of the reference method, for their absolute BIAS values are smaller than that of the reference method. That is, in the practical equating work with similar conditions, selecting the power of 50 can obtain better equating performance than the reference method.
The absolute BIAS values for all methods increased as the ability difference between the two groups increased. It is worth noting that even though 0.3 is the preset ability difference between the two groups-that is, the weights used in the 3E approaches were calculated under the same setting-the equating deviation of all approaches is still greater than that of the condition that the ability difference between the two groups is 0.1. However, the difference between the reference method and the worst-performing 3E approaches decreases with increasing ability drift. Taking the sample size of 200 as an example, when the ability difference was 0.1, 0.3, and 0.5, the absolute BIAS values between the SWA method and the reference method were 0.445, 0.368, and 0.203, respectively.
Regardless of the ability difference, the absolute BIAS values of the reference, the PrWA_100, and PrWA_50 methods decreased as the sample size increased from 200 to 1000. However, larger samples did not always lead to better equating when the sample size was greater than 1000. For example, when the ability difference between the two groups was 0.1 and 0.3, the absolute BIAS values of the reference, the PrWA_100 and the PrWA_50 methods decreased when the sample size increased from 200 to 1000 but increased when the sample size increased from 1000 to 1500. And when the ability difference was 0.5, the absolute BIAS values of the reference, the PrWA_100, and the PrWA_50 methods showed a downward trend with increasing sample size. Larger sample sizes do not always lead to better equating results when the sample size exceeds 1,000 may be due to a phenomenon known as the "law of diminishing returns. " As the sample size increases, the estimates derived from the equating methods become more stable and closer to their true values. However, after a certain point, the estimates are already stable enough, and further increasing the sample size provides minimal additional information.
Besides, the equating methods may have inherent limitations that prevent them from achieving perfect accuracy, regardless of the sample size. In these cases, increasing the Note. PrWA_100, PrWA_50 and PrWA_5 represented the 3E approaches with weight proportional to the precision's powers of 100, 50 and 5, respectively. sample size may not lead to significant improvements in equating accuracy, as the limitations are related to the methods rather than the sample size. In addition, the effect of sample size on the absolute BIAS values of the PrWA_5, SqrWA and SWA methods did not show a uniform pattern. The RMSD results present a similar pattern to the absolute BIAS values and are shown in Table 2. Table 3 shows that regardless of which equating method was used, the BIAS value increased as the ability difference between the two groups increased. The greater the difference in ability between the two groups and the smaller the sample size, the more pronounced the advantage of the 3E approaches. When the ability difference between the two groups was 0.1, only when the sample size was 200, the BIAS values of the PrWA_100 and PrWA_50 methods were smaller than that of the reference method; when the ability difference was 0.3, and the sample size was 200, 500 and 1000, the BIAS values of the PrWA_100 and PrWA_50 methods were smaller than that of the reference method. Whereas, when the ability difference was 0.5, the BIAS values of the PrWA_100 Note. PrWA_100, PrWA_50 and PrWA_5 represented the 3E approaches with weight proportional to the precision's powers of 100, 50 and 5, respectively. and PrWA_50 methods were smaller than that of the reference method, regardless of the sample size. Among the five 3E approaches, the BIAS values tended to decrease as the powers of the precision used in the weighting schemas increased. Still, the values may not continue to fall as the power increased to 50, especially when the ability difference between the two groups was large.
The sample size has no uniform effect on the reference method, and for the 3E approaches, their BIAS values at a sample size of 2000 were always greater than those at a sample size of 200. However, this did not mean the BIAS value consistently increased according to the sample size. For example, the BIAS values of all the compared 3E approaches under the condition of 1000 sample size were all smaller than that for the 500-sample size condition.
In summary, in terms of equating accuracy, the higher the power of precision used in the weighting schemas for the 3E approaches, the better the performance. When the number of powers reaches 50, it is enough to outperform the eight reference learners in most cases. In addition, the greater the ability difference between the two groups and the smaller the sample size, the more noticeable the advantages of the 3E approaches (i.e., the PrWA_100 and PrWA_50 methods) over the reference method.

Empirical Study
To showcase the performance of the new method in a practical setting, we present an empirical study using the final examination scores of fifth-year undergraduate students from a medical school. The surgery exam consisted of two rounds, corresponding to two tests, each containing 40 dichotomously scored items, with 12 common items between them. A total of 201 students were randomly assigned to the two tests. The descriptive information of raw scores can be found in Table 4. The eight equating methods used in the simulation study, along with the PrWA_50 method, were employed to equate the scores of the first round to those of the second round. The dataset and the code used can be obtained by contacting the corresponding author. The equating results are displayed in Figure 4. Since no student's total score was below 10, the figure only shows the raw scores from 10 to 40. Although the true equating transformation is not known in the real dataset, it can be observed that the differences among various equating methods are not substantial. The PrWA_50 method is shown with a thicker red line, and in the score range of 10 to 30, it is closest to the results of the chained mean equating method. When the scores are above 30, it is more similar to the results of the two equipercentile equating methods. In other words, the PrWA_50 method can combine the results of various equating methods with different weights across different score ranges.

Figure 4
Equating Results for a Real-World Dataset

Discussion and Conclusion
EL is a powerful technique that utilizes multiple models to improve the reliability and accuracy of individual models. This study proposed an empirical ensemble equating (3E) approach that treats multiple equating functions as learners in EL and adopted several weighting schemes to improve the equating accuracy under the NEAT Design. The simulation study found that the 3E approach with weights proportional to the precision's powers of 50 or 100 can yield more accurate equating results than the eight ensembled equating methods in most cases. The 3E equating approach proposed in this study can better support the exchangeability of scores across different forms of an assessment, thereby guaranteeing the fairness of the assessment and providing support in construct ing item banks more scientifically. Holland and Strawderman (2009) introduced several approaches to averaging two or more equating functions, provided details on how to weigh the equating parameters, and discussed some properties of the averages of equating functions. The traditional methods of averaging equivalence functions introduced by Holland and Strawderman (2009) (e.g., the point-wise weighted average method, the angle bisector method, and the symmetric weighted average method) can be seen as following the ensemble idea but limited to linear equating functions or two nonlinear equivalence functions. The 3E approach proposed in this study can ensemble various linear or nonlinear equating methods and adopt various weighting schemes, which is more generalized and flexible, as the 3E eventually turns an equated score sheet in rather than a model.
Another perspective of understanding the utility of the proposed 3E approach is com paring it to the prior sensitivity analysis in Bayesian analysis. From the perspective of Bayesian analysis, the ideal priors should accurately reflect preexisting knowledge of the world, both in terms of the facts and the uncertainty about those facts. Priors that do not correspond to reality, however, can lead to severe bias (e.g., Baldwin & Fellingham, 2013;van Erp et al., 2019). Thus, a prior sensitivity analysis aiming to update one's prior beliefs with the data is vital to improve the performance of Bayesian modeling. Compared to typical prior sensitivity analysis, improving equating performance by proposing multiple equating designs is scarce in equating studies. There are commonalities between the proposed ensemble learning method with Bayesian prior sensitivity analysis. First, both utilize multiple methods (frequentist perspective) or priors' settings (Bayesian perspec tive) to improve prediction accuracy. The selection process could be arbitrary or purpose ful for both frameworks depending on the research purpose. For instance, researchers may include the chained linear equating method as one learner because she/he, as an equating expert, believes the method is appropriate. Previous Bayesian literature has shown that some models perform better than the frequentist model when priors reflect researchers' beliefs appropriately (van Erp et al., 2019). Second, selecting those sources reflects different hypotheses (frequentist perspective) or beliefs (Bayesian perspective). In the simulation study, eight learners are chosen for learning. Like informative and uninformative prior, in ensemble learning, each learner reflects strong or weak hypothe ses and thus performs better in specific scenarios than others. Third, some methods are needed to summarize the outputs of different outputs and obtain the best result. This study shows that EL is very flexible when adopting different average schemas. At the same time, in Bayesian analysis, the prior setting could be updated using the information of previous prior settings to get better performance.
To develop further from this initial attempt, research directions can extend to setting comparison and monotonicity constraints. The former question relates to the assump tions for empirical settings of the 3E simulation scenarios; Are large-scale assessments more appropriate than the smaller ones as the estimated parameters are more stable and trackable due to their standardized nature? The second problem is that the equated scores do not grant a straight ascending/descending order; Can smoothing functions be used to ease the monotonicity violation? Thirdly, only the absolute bias has been used to calculate weights. It might be beneficial to consider both the equating bias and standard deviation when determining the weights for the ensemble approach. Finally, it's essential to remember that the 3E approach highly relies on the learners' qualities. Even though the contaminations by bad learners can be well controlled within the 3E approach's framework, a relatively large number of bad learners can still be detrimental to the final equated estimates.