In social sciences, Structural Equation Modelling (SEM; Bollen, 1989; Hoyle, 2012) is widely used to investigate relations between constructs (e.g., emotions, motivation), referred to as “structural relations” within SEM. Researchers are often interested in how these structural relations vary across groups. For instance, Michael and Kyriakides (2023) examined how academic motivation mediated the effect of socioeconomic status on reading achievement among 15-year-old students and how this differed across 38 countries.
To study differences in structural relations, Multigroup SEM (MG-SEM) and Multilevel SEM (ML-SEM) can be used. MG-SEM estimates the structural relations for each group and allows testing whether they are equal across groups. ML-SEM captures variations in structural relations by normally distributed random effects around the overall mean estimate for each relation. Even though group-specific estimates of relations can be derived from random effects, only the mean and variance of each random effect are part of the model parameters, which makes ML-SEM more parsimonious, allowing for accurate parameter estimates in case of very small sample sizes per group. To pinpoint which groups have the same relations and for which groups they differ, MG-SEM and ML-SEM require pairwise comparisons of group-specific relations. As the number of groups increases, performing pairwise comparisons quickly becomes infeasible. For example, for 38 groups, this requires 703 pairwise comparisons per structural relation. To reduce the number of comparisons, mixture modeling (McLachlan & Peel, 2000) can be used to cluster groups based on similarity of the structural relations. Before performing such a clustering, it is essential to ensure that the structural relations are validly comparable across groups and that they are the only source of differences driving the clustering.
In social sciences, the constructs of interest are typically unobserved or latent variables, also known as “factors” in SEM. SEM addresses their latent nature by including a measurement model (MM), which specifies how latent variables are measured by observed indicators (often questionnaire items), whereas the relations of interest among the latent variables are part of the structural model (SM). For valid comparisons of constructs and their relations, measurement invariance (MI) must hold across the groups. MI implies that the MM is equal across groups, meaning that the constructs are measured in the same way, so that observed differences reflect differences in the constructs rather than differences in measurement.
MI is examined at different levels by assessing the equality of different subsets of MM parameters. Configural invariance evaluates whether the factor structure is the same across groups, meaning that, in each group, the same set of indicators relates to a factor. The strength and direction of the relations between factors and indicators are quantified by factor loadings. Whereas configural invariance only deals with which factor loadings are non-zero, weak or metric invariance requires the loadings to be equal across groups. Next, strong and strict invariance impose equality of the items’ intercepts and residual or ‘unique’ variances, respectively. Metric invariance is a prerequisite for validly comparing structural relations, whereas strong and strict invariance are not required. When full metric invariance (i.e., invariance of all loadings) does not hold, partial metric invariance (i.e., invariance of some loadings) still enables valid comparisons of structural relations (Byrne et al., 1989), as long as the loading differences are captured in the model (e.g., by group specific loadings). The same holds for differences in item intercepts and unique variances.
When combining SEM with mixture modeling, groups can be clustered on their structural relations by making the structural relations cluster-specific (i.e., the same for all groups assigned to a cluster). In traditional mixture SEM methods (Arminger & Stein, 1997; Dolan & van der Maas, 1998; Jedidi et al., 1997), MM parameters can be specified as invariant or cluster-specific, implying that MM differences can either be ignored or captured by the same clustering. To cluster groups only on the structural relations rather than also on differences in measurement, a framework of novel mixture SEM methods emerged recently. Perez Alonso and colleagues (2024) introduced Mixture Multigroup SEM (MixMG-SEM), which combines MG-SEM with mixture modeling. Zhao and colleagues (2025a) proposed Mixture Multilevel SEM (MixML-SEM), which builds the mixture clustering onto the more parsimonious ML-SEM. The aim of both methods is to cluster groups specifically on the structural relations while accounting for measurement non-invariance, but the difference is that MixML-SEM uses Multilevel Confirmatory Factor Analysis (ML-CFA) with random effects to deal with MM differences, whereas MixMG-SEM uses Multigroup Confirmatory Factor Analysis (MG-CFA) with group-specific MM parameters. Their estimation builds on the stepwise “Structural-After-Measurement” (Rosseel & Loh, 2024) approach, where the MM is estimated first, using either MG-CFA or ML-CFA, followed by the SM, which includes clustering the groups on their structural relations. For comparability of structural relations, both methods require at least partial metric invariance and impose exact equality for the invariant factor loadings (i.e., exact MI). However, with a large number of groups, achieving exact MI is often unrealistic. To address this, Multigroup Bayesian SEM (MG-BSEM; Muthén & Asparouhov, 2012, 2013a, 2013b) with Approximate MI (AMI) uses priors with small variances for the MM parameters to allow for small differences across groups while keeping them approximately equal. In this paper, we present Mixture Multigroup BSEM (MixMG-BSEM), which extends MG-BSEM with mixture modeling to cluster groups on the structural relations while capturing approximate invariance of factor loadings.
MixMG-BSEM, MixMG-SEM and MixML-SEM differ in their first estimation step only, that is, in their MM and the corresponding MI assumptions. MixMG-SEM and MixML-SEM require exact invariance for (at least) some loadings, whereas the first step of MixMG-BSEM is a MG-CFA with Bayesian estimation (MG-BCFA) that assumes approximately invariant loadings. Approximate invariance lies between exact invariance (where parameters are exactly equal across groups) and non-invariance (where parameters can differ substantially across groups), where exact invariance is more closely approximated as the variances of the priors become smaller.
The paper is structured as follows: We begin with a description of MixMG-BSEM in the Method section. Next, we evaluate its performance through a Simulation Study. Finally, the Discussion section summarizes the main findings and addresses limitations and future directions.
Method
As mentioned above, MixMG-BSEM is estimated in a stepwise manner, building on the SAM approach (see Figure 1). In Step 1, MG-BCFA with small-variance priors is performed for each factor, and factor scores are extracted. In Step 2, these factor scores are used as single indicators to obtain group-specific factor covariances with Croon’s correction (Croon, 2002). In Step 3, the SM is estimated, including the clustering and the cluster-specific structural relations, using an Expectation-Maximization (Dempster et al., 1977) algorithm for maximum likelihood estimation. Note that Steps 2 and 3 are the same as for MixML-SEM and are therefore only briefly described below (for details, see Zhao et al., 2025a).
Figure 1
Mixture Multigroup Bayesian SEM with Approximate Measurement Invariance
Step 1: Measurement Model With Bayesian Approximate Measurement Invariance
The MM defines how the factors are measured by the items and MG-CFA is used to compare MMs across groups. Note that we estimate the MM per factor, which implies that we assume the factors to be independent in Step 1. Indicating an individual in Group ( by and gathering the responses on the items measuring Factor () in the Vector , the MG-CFA model for Factor q is expressed as:
1
where is a-dimensional vector of intercepts for Group , is a -dimensional vector of factor loadings (i.e., item-factor relations) for Group , denotes the latent variable score for the individual, and is a -dimensional vector of residuals, with the diagonal of containing the group-specific unique variances of the items. To set the scale of each factor, one can either set its variance to one or use the marker variable approach by fixing one loading (ideally, a strong and invariant loading) to one, for each group. In this paper, we adopt the marker variable approach to ensure that a one-unit change in the underlying factor has the same meaning across groups.
Since small differences in MM parameters are common across many groups and still allow for latent variable comparisons, we apply the assumption of approximate metric invariance (i.e., for all Groups ) instead of exact invariance (i.e., ) in Step 1 of MixMG-BSEM. This is accomplished by using MG-CFA with Bayesian estimation1 (MG-BCFA) and applying small-variance, normally distributed priors to the corresponding factor loadings, which constrain the group-specific loadings to be approximately equal (Muthén & Asparouhov, 2012). For this, both Mplus (Muthén & Muthén, 1998) and the R-package blavaan (Merkle et al., 2021) are available, but we use blavaan by default because it is free and open-source. In blavaan, AMI is achieved by applying small-variance priors in every group except for the reference group, which is the first group (by default). A non-informative prior is used for the parameter in the first group and the parameter estimate for this group is used as the mean of the small-variance priors for that same parameter in the other groups.
Since Bayesian estimation can be computationally challenging, two measures are taken to lower the computation time of Step 1 of MixMG-BSEM: (1) the data are centered per group to remove the mean structure (i.e., and ), which is irrelevant to the comparison of structural relations, and (2) MG-BCFA is performed for each factor separately, which is in line with the “measurement blocks” approach in SAM (Rosseel & Loh, 2024) with one factor per measurement block. This approach lowers the number of parameters to be estimated and also enhances the model's robustness against MM misspecifications, such as a few unmodeled crossloadings.
In this paper, we assume that all factor loadings, except for the marker variable loadings, are approximately invariant, while the unique variances and factor variances are estimated as group-specific parameters (i.e., with the default non-informative priors per group). Note that it is harmless to specify exactly invariant loadings as approximately invariant since they will then be estimated as nearly identical across groups. Of course, in practice, combinations of exactly and approximately invariant loadings can be applied in Step 1 of MixMG-BSEM. Moreover, in theory, all combinations of exact invariance, approximate invariance and non-invariance can be used, but complex combinations may cause convergence problems.
To determine which parameters are (approximately) invariant or non-invariant, MI testing should be performed prior to using MixMG-BSEM. Note that, if exact invariance does not hold for a parameter, standard MG-CFA requires a tedious process of comparing group-specific parameter estimates to determine whether differences reflect non-invariance or approximate invariance. Instead, MG-BCFA allows to test the tenability of AMI directly by imposing small-variance priors on MM parameters and assessing model fit. Muthén and Asparouhov (2012) recommend starting with a very small variance (e.g., 0.001) and, if needed, the priors’ variances can be increased to reach a good model fit. In this way, MG-BCFA provides information on how large the parameter differences are (i.e., on the level of AMI). Model fit can be assessed using the posterior predictive p value (Gelman et al., 1996), but it is not very sensitive to the prior variances in case of large samples. Other fit measures include the Bayesian RMSEA (BRMSEA; Hoofs et al., 2018) and the Deviance Information Criterion (DIC; Spiegelhalter et al., 2002). The DIC balances model fit (i.e., the posterior mean deviance) and complexity (i.e., the effective number of parameters) in Bayesian models, with smaller values indicating a better balance. Regarding the prior selection in MG-BSEM, Kim et al. (2017) found that the DIC often selected models with smaller prior variances when the sample size was small and Pokropek et al. (2020) found that the DIC performed better as sample size increased, and recommended using the DIC with thresholds tailored to different sample sizes.
Once the marker variables and the approximately invariant loadings are confirmed by the MI testing, we obtain the specification of the MG-BCFA model that corresponds to the first step of MixMG-BSEM. In the next step, we need estimates of the factor scores and their uncertainty. To this end, the means and standard deviations of the posterior distributions of the individuals’ factor scores (i.e., estimated latent variable scores) are appended to the data file.
Step 2: Single-Indicator Approach to Obtain Group-Specific Factor Covariances
In a single-indicator approach, the factor scores are used as the “observed” proxy (or a single indicator) for the latent variable (Vermunt, 2025). Since factor scores are only estimates of the true latent variable scores, we apply Croon’s correction (2002) to the factor score covariances () to obtain unbiased estimates of the true latent variable covariances (), here denoted as :
2
where corresponds to the diagonal matrix of group-specific factor loadings (reflecting the reliability of the factor scores) and is the diagonal matrix of group-specific unique variances. These estimates correspond to the MM parameters of the single indicators of the factors (i.e., the factor scores) rather than the original, observed indicators. These MM parameters are derived from the posterior mean and standard deviation estimates for the factor scores, obtained from Step 1. For details, please refer to Equations (7–8) in the MixML-SEM paper (Zhao et al., 2025a).
Step 3: Structural Model With Mixture Clustering of the Groups
In Step 3, MixMG-BSEM clusters the groups and estimates cluster-specific structural relations. The SM is thus conditional on the cluster membership, , which denotes whether Group belongs to Cluster . Whereas the true cluster membership is assumed to be either 1 or 0, its estimation, , ranges from 0 to 1 and represents the probability of Group belonging to Cluster . The model-implied factor covariance matrix , given that , is defined as:
3
where contains the cluster-specific regression coefficients between latent variables, and is the residual factor covariance matrix, which is specified as group-and-cluster-specific to ensure that clustering is driven only by the regressions (for details, see Perez Alonso et al., 2024). The SM is estimated with maximum likelihood estimation using as input.
For the mixture clustering in MixMG-BSEM, it is assumed that the (true) latent variable scores are sampled from a mixture of multivariate normal distributions. Specifically, all latent variable scores of Group , , are assumed to be sampled from the same distribution:
4
where is the population density function, represents the set of population parameters, and is the prior probability that Group belongs to Cluster . The scores in are assumed to follow a normal distribution with as the factor mean (which is zero due to centering) and as the factor covariance matrix. The unknown parameters are estimated by maximizing the following log-likelihood function:
5
where is the group-specific factor covariance matrix from Step 2 (Equation 2), and is the group-and-cluster-specific factor covariance matrix from Step 3 (Equation 3). The maximum likelihood estimation is performed using the EM algorithm (Dempster et al., 1977). Specifically, in the E-step, the algorithm estimates the classification probabilities given the current parameter estimates. In the M-step, the algorithm estimates the unknown parameters given the classification probabilities obtained from the E-step. The E- and M-steps are iterated until convergence. A multi-start procedure is applied to mitigate convergence to local maxima, where the converged solution with the highest loglikelihood across the different starts is selected as the final result. For an in-depth explanation of the technical details of Step 3, readers are referred to Appendix A of the paper by Perez Alonso et al. (2024).
Simulation
In the simulation study, we evaluated the performance of MixMG-BSEM, assuming the true number of clusters was known. Firstly, we aimed to examine how MixMG-BSEM’s performance was affected by the within-group sample size, the number of groups, the number of clusters, the item reliability, the AMI of the loadings, and the size of (differences in) regression parameters. On top of that, since the first step of MixMG-BSEM estimates the MM per factor, we evaluated the consequences of ignoring crossloadings in this step. Literature on traditional SEM has shown that factor correlations tend to be overestimated when crossloadings are constrained to zero (e.g., Asparouhov et al., 2015; Marsh et al., 2009, 2010, 2014), which may affect the comparison of structural relations. However, given its stepwise estimation and measurement block approach, MixMG-BSEM may be relatively robust to overlooked crossloadings (Rosseel & Loh, 2024), but the recovery of clusters and regression parameters may still decline in case of multiple crossloadings. Secondly, in terms of the analysis, we examined the impact of a key aspect of the Bayesian estimation; that is, the impact of different prior variances for the loadings on the recovery of clusters and cluster-specific regressions. We expected that using too narrow priors might fail to capture the loading differences across groups, which may affect the estimation of and clustering on the structural relations. Additionally, we also evaluated which prior was selected by the Deviance Information Criterion (DIC), since selecting this prior is an important step in empirical practice.
In a complete factorial design, the following factors were manipulated:
Total number of groups (2 levels): 24, 48.
Within-group sample size (3 levels): 50, 100, 200.
Number of clusters (2 levels): 2, 4.
Size of regression parameters (3 levels): 0.2, 0.3, 0.4.
Item reliability (2 levels): high, low.
Level of AMI for loadings (5 levels): 0.001, 0.005, 0.01, 0.05, 0.1.
Size of crossloadings (3 levels): 0, 0.2, 0.4.
We chose a minimum of 24 groups with group sizes ranging from 50 to 200, which partially correspond to the group sizes in other simulation studies on Bayesian AMI (Kim et al., 2017; Lek et al., 2018). The number of groups in each cluster depended on the number of groups , and the number of Clusters , where each cluster contained an equal number of groups. Note that larger , larger , and smaller result in larger within-cluster sample sizes, which were expected to improve the performance of MixMG-BSEM.
The data were generated from a SEM model with four latent variables, each measured by five items (see Figure 2), as in Perez Alonso et al. (2024) and Zhao et al. (2025a). Specifically, the data were generated from a multivariate normal distribution (MVN) with covariance matrix , determined by the parameters , , and (see Equation (6) in Perez Alonso et al., 2024).
Figure 2
The Data-Generating Model With Exogenous Factors F1 and F2 and Endogenous Factors F3 and F4
The size of the regression parameters was set to and, as shown in Figure 3, the differences between clusters were introduced by setting one regression parameter to zero in each cluster. Hence, larger values of resulted in larger differences and thus in greater separation between clusters, which should make the clusters easier to recover.
Figure 3
The Cluster-Specific Structural Relations
For the group-and-cluster-specific residual factor covariances , we sampled the variances of the exogenous factors and from a uniform distribution and their covariance from . The total variances of the endogenous factors and were also sampled from and their residual variances are determined as follows: For and F4, it was computed as and , respectively.
In Loading Matrix , the first loading of each factor was fixed to one. The other loadings (except for crossloadings) were approximately invariant across groups and were sampled from a normal distribution with a mean of in case of a high item reliability and in case of a low item reliability, and a variance that depended on the level of the AMI. According to a summary by Arts et al. (2021), prior variances used in simulation and empirical studies typically range from 0.001 to 0.1. Following this, our simulations included five levels of AMI (i.e., 0.001, 0.005, 0.01, 0.05, 0.1). For instance, to obtain an AMI level of 0.01, which implies a variance of 0.01 for differences in loadings, we sampled loadings from a normal distribution with a variance of 0.005 for all groups.2 Per factor, one crossloading was added to the third item measuring the next factor (i.e., Item 8 crossloaded on Factor 1, Item 13 on Factor 2, Item 18 on Factor 3, and Item 3 on Factor 4). A value of 0 corresponded to no crossloading, 0.2 to a moderate crossloading, and 0.4 to a large crossloading. The unique variances on the diagonal of were sampled from under high reliability and from under low reliability.
Finally, the data were sampled from for each group. In total, we generated 2 (number of groups) × 3 (within-group sample size) × 2 (number of clusters) × 3 (size of regression parameters) × 2 (reliability) × 5 (size of AMI) × 3 (size of crossloadings) × 50 (replications) = 54,000 data sets according to the described procedure, using R Version 4.4 (R Core Team, 2022). All data sets were analyzed with MixMG-BSEM with 50 random starts and the true number of clusters. For each data set, we performed the analysis five times, with different prior variances for the loadings (i.e., 0.001, 0.005, 0.01, 0.05, 0.1) in Step 1, to examine the performance of MixMG-BSEM across different prior variances. The analyses were performed on a supercomputer consisting of 2 Intel Xeon Platinum 8468 CPUs (Sapphire Rapids). The average computation time was 50.5 minutes () for Step 1 (mainly influenced by and ) with the four factors estimated sequentially, 0.03 minutes () for the intermediate Step 2, and 5.7 minutes () for Step 3 (mainly influenced by , , and .3 We assessed MCMC convergence of the Bayesian estimation using the potential scale reduction factor (PSRF). On average, the PSRF was 1.000 (), indicating sufficient convergence, with no notable differences across conditions. These results suggest that the MCMC chains converged properly for all simulated data sets.
Results
Recovery of Factor Loadings
We evaluated the recovery of the group-specific factor loading estimates for each item , using the Root Mean Squared Error (RMSE) across groups as follows:
3
where is the true group-specific loading of the j-th item on the factor, and is the corresponding estimate. Note that is also affected by the variance of the estimates, rather than just the bias.
When using MixMG-BSEM with the true prior variances for the loadings, the average across the four factors and all simulated data sets was 0.066, 0.091, 0.066, and 0.066, respectively, for the loadings of the second to the fifth item of each factor (Table 1, last row). Note that was larger due to the disregarded crossloadings on that item. This was also the only value that differed across the four factors. Specifically, the values were 0.093, 0.072, 0.098, and 0.101, for to , respectively. It seems that the third loading for is affected by the ignored crossloading the least, which may be explained by the fact that, unlike the other factors, is involved in only one direct regression relation with the other factors4 and is thus less correlated with the other factors. When the crossloadings were zero (i.e., without crossloadings), took on the same values as for the other loadings (), whereas they increased with larger crossloadings (see Table 1). was also higher in case of larger , which implies stronger correlations between factors (see Table 1). Note that larger higher item reliability, and lower levels of AMI — thus applying lower prior variances — resulted in lower for all items. The latter is explained by the fact that a lower prior variance more strongly approximates an equality constraint, which lowers the sample size requirements and thus the estimates’ variability for a given sample size.
Table 1
The Average for Factor Loading Estimates When Using the True Prior Variances for the Loadings
| Factor | Level | (SD) | (SD) | (SD) | (SD) |
|---|---|---|---|---|---|
| 24 | 0.066 (0.031) | 0.091 (0.041) | 0.066 (0.031) | 0.066 (0.031) | |
| 48 | 0.065 (0.031) | 0.090 (0.041) | 0.065 (0.031) | 0.065 (0.031) | |
| 50 | 0.079 (0.038) | 0.102 (0.046) | 0.079 (0.038) | 0.079 (0.038) | |
| 100 | 0.065 (0.028) | 0.091 (0.039) | 0.065 (0.028) | 0.065 (0.028) | |
| 200 | 0.052 (0.019) | 0.081 (0.035) | 0.052 (0.019) | 0.052 (0.019) | |
| 2 | 0.066 (0.031) | 0.091 (0.041) | 0.066 (0.031) | 0.066 (0.031) | |
| 4 | 0.066 (0.031) | 0.091 (0.041) | 0.066 (0.031) | 0.066 (0.031) | |
| 0.2 | 0.066 (0.031) | 0.080 (0.035) | 0.066 (0.031) | 0.066 (0.031) | |
| 0.3 | 0.066 (0.031) | 0.090 (0.039) | 0.066 (0.031) | 0.066 (0.031) | |
| 0.4 | 0.065 (0.031) | 0.103 (0.045) | 0.065 (0.031) | 0.065 (0.031) | |
| Reliability | high | 0.061 (0.028) | 0.087 (0.039) | 0.061 (0.027) | 0.061 (0.027) |
| low | 0.070 (0.034) | 0.095 (0.043) | 0.070 (0.034) | 0.070 (0.034) | |
| AMI | 0.001 | 0.028 (0.006) | 0.055 (0.029) | 0.028 (0.006) | 0.028 (0.006) |
| 0.005 | 0.048 (0.006) | 0.072 (0.027) | 0.048 (0.006) | 0.048 (0.006) | |
| 0.01 | 0.060 (0.009) | 0.084 (0.027) | 0.060 (0.009) | 0.060 (0.009) | |
| 0.05 | 0.090 (0.020) | 0.116 (0.032) | 0.090 (0.020) | 0.090 (0.020) | |
| 0.1 | 0.102 (0.027) | 0.128 (0.037) | 0.102 (0.026) | 0.102 (0.027) | |
| Crossloadings | 0 | 0.066 (0.031) | 0.066 (0.031) | 0.066 (0.031) | 0.066 (0.031) |
| 0.2 | 0.066 (0.031) | 0.086 (0.032) | 0.066 (0.031) | 0.065 (0.031) | |
| 0.4 | 0.066 (0.031) | 0.122 (0.037) | 0.066 (0.031) | 0.066 (0.031) | |
| Total | 0.066 (0.031) | 0.091 (0.041) | 0.066 (0.031) | 0.066 (0.031) | |
To illustrate the effect of the prior variances for the loadings, across different prior variances is shown in Figure 4. The diagonal of the plot represents cases where the prior variances were correctly specified, while the lower part shows cases where the priors were narrower than the true level of AMI. In general, applying too narrow or too wide priors resulted in larger values. Since the prior variance affected the loading recovery, we also evaluated prior selection using the Deviance Information Criterion (DIC), which balances model fit and complexity. When looking at the prior selection per loading, the correct selection rate was 59.2% across all loadings and all simulated data sets.5 For 28.2% of the data sets, the prior selection was flawless in the sense that true priors were selected for all loadings. Generally, the DIC tended to select either the true or slightly smaller prior variances. Specifically, for an AMI level of 0.001, DIC correctly selected the prior variance of 0.001 for 86.6% of the loadings, with smaller proportions selecting 0.005 (12.5%) and 0.01 (0.9%). For an AMI level of 0.005, the correct selection rate was 59.4%, followed by 0.001 (36.6%) and 0.01 (4.0%). For an AMI level of 0.01, DIC most frequently selected 0.005 (71.1%), followed by 0.01 (23.5%) and 0.001 (5.4%). For an AMI level of 0.05, DIC primarily selected the true prior variance (81.5%), followed by prior variances of 0.1 (10.0%) and 0.01 (8.5%). For an AMI level of 0.1, DIC mostly selected prior variances of 0.05 (55.1%) and 0.1 (44.9%). Although prior selection based on the DIC varied across different levels of AMI and was not always optimal, the resulting values remained relatively stable given the selected priors, suggesting stable factor loading recovery despite prior variance misspecification.
Figure 4
Across Different Prior Variances, Indicated by the Columns, Whereas the Rows Represent the True Levels of AMI
Note. The diagonal (in white) contains cases where the prior variances were correctly specified, while the lower part represents cases where the priors were too narrow. For each row, the cells are colored red if the is larger than the on the diagonal, and blue if it is smaller.
Sensitivity to Local Maxima
To evaluate how often (Step 3 of) MixMG-BSEM converged to a local maximum, we compared the log-likelihood of the final best solution (out of 50 random starts) to the one obtained when starting from the true clustering, which is a proxy for the global maximum. If was more than 0.001 lower than the proxy, the solution was considered a local maximum. Overall, when applying the true priors, MixMG-BSEM ended up in a local maximum for 0.01% of the data sets, with all local maxima occurring in case of with .
Recovery of Clusters
The Adjusted Rand Index (ARI; Hubert & Arabie, 1985) measures the similarity between two partitions while correcting for chance, with a value of one indicating perfect agreement and zero indicating the level of agreement between two random partitions. To compute the ARI, the modal clustering (i.e., assigning each group to the cluster with the highest classification probability) was compared to the true clustering. Additionally, the correct clustering rate (%CC) was computed as the percentage of correctly clustered groups.
When using the true priors, the average ARI across all simulated data was 0.644 and the %CC was 84.8%.6 To check which main effects and two-way interactions among the manipulated factors significantly influenced the ARI, we conducted an analysis of variance (ANOVA) using the aov function in R. The ANOVA results table is provided in Supplementary Material (S1; Zhao et al., 2025b). Firstly, all main effects were significant at the level. From Table 2, we see that larger , larger , smaller , larger , higher reliability, all led to better recovery of clusters, with and having the strongest effect, as reflected in higher ARI and %CC. The differences in ARI across levels of AMI and crossloadings were very small, suggesting that cluster recovery was relatively unaffected by these factors. Secondly, , , and reliability all interacted significantly with one another (i.e., all six of their two-way interactions were significant), which is why we further illustrate the four-way interaction effect of these four manipulated factors on ARI in Figure 5. We see that the ARI was very sensitive to , and more so in case of , smaller , or low reliability. According to Steinley (2004), ARI values above 0.80 indicate good cluster recovery. Using this rule-of-thumb, the cluster recovery was good when with , or when with When , the ARI only exceeded 0.80 with = 200 for two clusters and a high reliability.
Table 2
The Average ARI and Correct Clustering Rate (%CC) When Using the True Prior Variances for the Loadings
| Factor | Level | ARI (SD) | %CC (SD) |
|---|---|---|---|
| 24 | 0.628 (0.331) | 0.840 (0.170) | |
| 48 | 0.660 (0.301) | 0.856 (0.159) | |
| 50 | 0.402 (0.280) | 0.733 (0.174) | |
| 100 | 0.663 (0.284) | 0.861 (0.145) | |
| 200 | 0.865 (0.187) | 0.949 (0.082) | |
| 2 | 0.719 (0.295) | 0.913 (0.106) | |
| 4 | 0.569 (0.320) | 0.782 (0.187) | |
| 0.2 | 0.400 (0.282) | 0.733 (0.175) | |
| 0.3 | 0.685 (0.280) | 0.871 (0.142) | |
| 0.4 | 0.849 (0.201) | 0.940 (0.095) | |
| Reliability | high | 0.682 (0.309) | 0.865 (0.159) |
| low | 0.606 (0.320) | 0.830 (0.170) | |
| AMI | 0.001 | 0.651 (0.314) | 0.851 (0.163) |
| 0.005 | 0.647 (0.317) | 0.849 (0.164) | |
| 0.01 | 0.649 (0.316) | 0.850 (0.164) | |
| 0.05 | 0.638 (0.319) | 0.845 (0.167) | |
| 0.1 | 0.635 (0.318) | 0.843 (0.166) | |
| Crossloadings | 0 | 0.654 (0.316) | 0.852 (0.164) |
| 0.2 | 0.644 (0.318) | 0.848 (0.166) | |
| 0.4 | 0.634 (0.318) | 0.844 (0.166) | |
| Total | 0.644 (0.317) | 0.848 (0.165) | |
Figure 5
The ARI for MixMG-BSEM in Function of the Within-Group Sample Sizes, Number of Clusters, Size of Regression Parameters, and Reliability
Across different prior variances, the ARI remained relatively stable. When the applied prior was too narrow or too large, the ARI slightly dropped. For example, for an AMI level of 0.1, it decreased from 0.635 when using the true prior variance to 0.608 when using a prior variance of 0.001. For an AMI level of 0.001, it decreased from 0.651 when using the true prior variance to 0.638 when using a prior variance of 0.1.
Recovery of Regression Parameters
To evaluate the recovery of the regression parameters, we computed the and the per regression parameter (i.e., , , , and ):
where is the estimated regression coefficient in cluster and is the corresponding true value. To get a better idea of the separate coefficients in different clusters (i.e., without averaging across clusters), we also reported the bias for in the Supplementary Material as an example (S2; Zhao et al., 2025b), with positive values indicating overestimation in most conditions. Similar trends were observed for and . For , the bias values were the smallest and close to zero (slightly negative).
On average, was 0.037, -0.004, 0.039, and 0.031 (Table 3), and was 0.073, 0.054, 0.072 and 0.069 (Table 4) for , , , and , respectively.7 We performed an ANOVA on as an example (S1). All main effects were significant at the level, except for AMI. Similar to the trends observed for the cluster recovery, larger , larger , smaller , larger , higher reliability, and smaller crossloadings resulted in smaller , with crossloadings having the strongest effect (Table 3). To get a better idea of how got influenced by the manipulated factors, we depicted the interaction effects of , , and across different levels of crossloadings and reliability (Figure 6). Specifically, larger crossloadings resulted in larger values, especially in case of lower item reliability. Note that the recovery of the regression parameters was barely affected by using different prior variances, even more narrow ones, likely due to the fact that the cluster recovery was hardly affected as well.
Table 3
The Average for Each of the Four Estimated Regression Parameters When Using the True Prior Variances for the Loadings
| Factor | Level | (SD) | (SD) | (SD) | (SD) |
|---|---|---|---|---|---|
| 24 | 0.037 (0.057) | -0.004 (0.045) | 0.039 (0.053) | 0.032 (0.051) | |
| 48 | 0.036 (0.046) | -0.004 (0.033) | 0.039 (0.043) | 0.031 (0.041) | |
| 50 | 0.040 (0.071) | -0.002 (0.059) | 0.040 (0.063) | 0.034 (0.063) | |
| 100 | 0.035 (0.044) | -0.005 (0.031) | 0.038 (0.043) | 0.030 (0.040) | |
| 200 | 0.034 (0.033) | -0.006 (0.018) | 0.038 (0.036) | 0.029 (0.031) | |
| 2 | 0.035 (0.046) | -0.005 (0.036) | 0.038 (0.042) | 0.031 (0.039) | |
| 4 | 0.038 (0.056) | -0.004 (0.043) | 0.040 (0.054) | 0.032 (0.053) | |
| 0.2 | 0.040 (0.059) | -0.002 (0.048) | 0.041 (0.055) | 0.036 (0.052) | |
| 0.3 | 0.037 (0.050) | -0.004 (0.038) | 0.039 (0.047) | 0.032 (0.046) | |
| 0.4 | 0.033 (0.044) | -0.007 (0.031) | 0.037 (0.042) | 0.026 (0.041) | |
| Reliability | high | 0.033 (0.045) | -0.005 (0.035) | 0.035 (0.043) | 0.028 (0.042) |
| low | 0.040 (0.058) | -0.004 (0.044) | 0.042 (0.053) | 0.034 (0.051) | |
| AMI | 0.001 | 0.037 (0.050) | -0.004 (0.039) | 0.040 (0.048) | 0.032 (0.046) |
| 0.005 | 0.036 (0.050) | -0.004 (0.039) | 0.040 (0.048) | 0.033 (0.046) | |
| 0.01 | 0.037 (0.051) | -0.004 (0.039) | 0.039 (0.048) | 0.032 (0.046) | |
| 0.05 | 0.037 (0.053) | -0.005 (0.041) | 0.039 (0.050) | 0.030 (0.047) | |
| 0.1 | 0.036 (0.054) | -0.005 (0.041) | 0.037 (0.048) | 0.029 (0.048) | |
| Crossloadings | 0 | 0.001 (0.043) | 0.001 (0.040) | -0.000 (0.036) | 0.002 (0.040) |
| 0.2 | 0.040 (0.043) | -0.004 (0.040) | 0.042 (0.037) | 0.033 (0.040) | |
| 0.4 | 0.069 (0.045) | -0.010 (0.039) | 0.075 (0.039) | 0.059 (0.041) | |
| Total | 0.037 (0.052) | -0.004 (0.040) | 0.039 (0.048) | 0.031 (0.047) | |
Table 4
The Average for Each of the Four Estimated Regression Parameters When Using the True Prior Variances for the Loadings
| Factor | Level | (SD) | (SD) | (SD) | (SD) |
|---|---|---|---|---|---|
| 24 | 0.081 (0.065) | 0.062 (0.057) | 0.079 (0.058) | 0.077 (0.061) | |
| 48 | 0.064 (0.051) | 0.045 (0.044) | 0.065 (0.046) | 0.061 (0.048) | |
| 50 | 0.105 (0.079) | 0.087 (0.069) | 0.100 (0.070) | 0.101 (0.075) | |
| 100 | 0.064 (0.042) | 0.047 (0.035) | 0.065 (0.039) | 0.061 (0.039) | |
| 200 | 0.048 (0.027) | 0.027 (0.017) | 0.051 (0.029) | 0.045 (0.024) | |
| 2 | 0.055 (0.039) | 0.036 (0.032) | 0.056 (0.037) | 0.052 (0.035) | |
| 4 | 0.090 (0.070) | 0.071 (0.061) | 0.088 (0.061) | 0.087 (0.066) | |
| 0.2 | 0.086 (0.067) | 0.067 (0.061) | 0.085 (0.062) | 0.083 (0.062) | |
| 0.3 | 0.070 (0.057) | 0.051 (0.050) | 0.070 (0.051) | 0.067 (0.054) | |
| 0.4 | 0.061 (0.049) | 0.043 (0.038) | 0.060 (0.041) | 0.058 (0.048) | |
| Reliability | high | 0.064 (0.048) | 0.048 (0.045) | 0.065 (0.047) | 0.062 (0.048) |
| low | 0.081 (0.068) | 0.059 (0.057) | 0.079 (0.058) | 0.076 (0.062) | |
| AMI | 0.001 | 0.072 (0.055) | 0.052 (0.050) | 0.071 (0.050) | 0.068 (0.053) |
| 0.005 | 0.072 (0.056) | 0.053 (0.051) | 0.071 (0.052) | 0.069 (0.054) | |
| 0.01 | 0.072 (0.057) | 0.053 (0.051) | 0.071 (0.051) | 0.069 (0.054) | |
| 0.05 | 0.074 (0.062) | 0.055 (0.052) | 0.073 (0.057) | 0.070 (0.057) | |
| 0.1 | 0.074 (0.066) | 0.055 (0.054) | 0.072 (0.054) | 0.070 (0.061) | |
| Crossloadings | 0 | 0.055 (0.058) | 0.052 (0.052) | 0.051 (0.052) | 0.056 (0.055) |
| 0.2 | 0.070 (0.058) | 0.053 (0.053) | 0.069 (0.050) | 0.067 (0.055) | |
| 0.4 | 0.092 (0.056) | 0.055 (0.050) | 0.096 (0.048) | 0.085 (0.053) | |
| Total | 0.073 (0.059) | 0.054 (0.052) | 0.072 (0.053) | 0.069 (0.056) | |
Figure 6
The for MixMG-BSEM in Function of the Within-Group Sample Sizes, Number of Clusters, Size of Regression Parameters, Crossloadings and Reliability
Section Conclusion
We assessed the performance of MixMG-BSEM when the true number of clusters was known. We found that performing 50 random starts in Step 3 largely prevented local maxima. The recovery of clusters and regression parameters was good when the within-group sample size was at least 200 and/or in case of a larger cluster separation (i.e., ). Ignoring larger crossloadings (by estimating the MM per factor) resulted in more biased estimates for factor loadings and regression parameters, while cluster recovery was less affected. Although prior selection based on the DIC was not always optimal, the recovery of clusters and regression parameters was relatively stable across the priors.
Discussion
We presented MixMG-BSEM as a new addition to the novel mixture SEM framework for comparing structural relations across many groups. Unlike the existing approaches that rely on the exact MI assumption, MixMG-BSEM adopts the more realistic assumption of AMI, which accommodates small differences in MM parameters across groups. Specifically, after estimating the MM using MG-BCFA with small-variance priors, MixMG-BSEM clusters groups with the same structural relations, thereby eliminating the need for pairwise comparisons of group-specific structural relations. Since its results may depend on the specified prior variances, results obtained with different prior variances should be compared and the best prior selected based on model fit measures such as the Deviance Information Criterion (DIC). Our simulation study results showed that both the clustering and regression parameter estimates were relatively insensitive to the choice of prior variances, however.
Currently, MixMG-BSEM estimates the MM per factor (i.e., with one factor per measurement block). In the simulation study, the cluster recovery was unaffected by ignoring crossloadings, but the recovery of the factor loadings and regression estimates was affected. Therefore, it would be valuable to investigate the performance of MixMG-BSEM when including factors with crossloadings in the same measurement block, at the cost of a (much) longer computation time, where small-variance priors could also be applied to the crossloadings to allow for small differences (Muthén & Asparouhov, 2012). However, it is important to note that the default prior mean for crossloadings is zero, whereas applying a prior mean of zero to a sizeable crossloading can negatively impact the regression parameter estimates (Wei et al., 2022). Therefore, researchers should gather prior information about crossloadings before choosing an appropriate prior (Wei et al., 2022). Note that, in cases with several crossloadings connecting more than two factors, the estimation may fail when all these factors are included in one measurement block, especially in case of many groups. Thus, in such cases, model estimation is only possible when partitioning the factors into smaller measurement blocks, highlighting the benefits of the measurement block approach even more.
While the simulation study evaluated the performance of MixMG-BSEM with approximate metric invariance for all loadings, except for the invariant marker variable loading, MixMG-BSEM can theoretically accommodate all combinations of exact, approximate and non-invariance for the loadings. The stepwise estimation of MixMG-BSEM conveniently allows to tweak the MG-BCFA model, for instance, by specifying certain loadings as non-invariant, before moving onto the next steps. Similarly, if group-specific loading estimates are virtually identical across groups, one may consider specifying the loading as exactly invariant. Specifying an invariant parameter as approximately invariant is rather harmless, whereas specifying a non-invariant parameter as approximately invariant may introduce bias in parameter estimation and affect the clustering. Note that MG-BCFA allows to evaluate non-invariances for all parameters, which is achieved by comparing group-specific estimates to the credible intervals of the average posterior estimates across all groups (e.g., Winter & Depaoli, 2020). In future research, it would be interesting to evaluate the performance of MixMG-BSEM when non-invariant loadings are specified as approximately invariant.
The simulation study assumed the number of clusters to be known, whereas this is typically unknown for empirical data. To determine the number of clusters, different methods are available, such as the Bayesian Information Criterion (BIC; Schwarz, 1978), Akaike Information Criterion (AIC; Akaike, 1974), and convex hull procedure (CHull; Ceulemans & Kiers, 2006). In brief, all these methods balance model fit (i.e., the log-likelihood) and model complexity (i.e., the number of parameters). BIC and AIC do so by combining model fit and a penalty for model complexity into a single criterion, whereas CHull uses a generalized scree test. Previous studies on model selection for MixMG-SEM (Perez Alonso et al., 2025) and MixML-SEM (Zhao et al., 2025a) have shown that combining AIC, BIC, and CHull — with visual inspection of the scree plot — is an effective way to determine the number of clusters. Since MixMG-BSEM performs the same mixture clustering on group-specific factor covariances as these methods, we expect these recommendations to generalize to MixMG-BSEM. However, in the future, it would still be useful to evaluate model selection for MixMG-BSEM specifically.
Currently, MixMG-BSEM combines Bayesian and maximum likelihood estimation, assuming continuous items. In empirical practice, we often work with ordinal items with a few response categories (e.g., Likert scale items). To accommodate ordinal data in MixMG-BSEM, only the first step (i.e., MG-BCFA) would need to be adjusted to deal with ordinal data (Muthén & Asparouhov, 2013a), whereas the subsequent steps would remain unchanged. In future studies, it will be valuable to evaluate the performance of MixMG-BSEM adapted to ordinal data.
In conclusion, MixMG-BSEM is an effective method for accommodating AMI while clustering structural relations of interest. By relaxing the strict assumption of exact MI, it extends the framework of novel mixture SEM methods in an important way, making it more suited for empirical applications where small differences in parameters across groups are expected.
This is an open access article distributed under the terms of the Creative Commons Attribution License (