Latent Profile Analysis of Human Values: What is the Optimal Number of Clusters?

Latent Profile Analysis (LPA) is a method to extract homogeneous clusters characterized by a common response profile. Previous works employing LPA to human value segmentation tend to select a small number of moderately homogeneous clusters based on model selection criteria such as Akaike information criterion, Bayesian information criterion and Entropy. The question is whether a small number of clusters is all that can be gleaned from the data. While some studies have carefully compared different statistical model selection criteria, there is currently no established criteria to assess if an increased number of clusters generates meaningful theoretical insights. This article examines the content and meaningfulness of the clusters extracted using two algorithms: Variational Bayesian LPA and Maximum Likelihood LPA. For both methods, our results point towards eight as the optimal number of clusters for characterizing distinctive Schwartz value typologies that generate meaningful insights and predict several external variables.

Latent Class Analysis (LCA; McCutcheon, 1987) is a method to classify respondents into latent types based on their response patterns on a set of variables. In contrast to con ventional heuristic techniques such as k-means (Bock, 2007) and hierarchical clustering (Ward, 1963), LCA is based on a finite latent variable mixture model (Berlin, Williams, & Parra, 2014;Eid, Langeheine, & Diener, 2003;Lanza, Flaherty, & Collins, 2003;McLachlan, Lee, & Rathnayake, 2019;Petersen, Qualter, & Humphrey, 2019). In the case where the observed variables are continuous, LCA is referred to as Latent Profile Analysis (LPA). A general rule in latent variable mixture modelling is that the larger the number of clusters, the more homogeneous the respective clusters will be. However, many of the works employing LCA (Davidov, Yang-Hansen, Gustafsson, Schmidt, & Bamberg, 2006;Eid et al., 2003;Moors & Vermunt, 2007;Reinecke & Seddig, 2011;Rudnev, Magun, & Schmidt, 2014) determined that the appropriate number of clusters is in the range of three to five on average, based on model selection criterion such as the Akaike information criterion (AIC), the Bayesian information criterion (BIC), and Entropy. Although these criteria are generally considered as reliable, they do not always agree, and researchers tend to settle on a small number of moderately homogeneous clusters. A question that remains to be answered is whether this small number of clusters is the only solution that can be gleaned from the data, or whether a larger number of theoretically and empirically meaningful clusters is concealed within the data that may advance our understanding of the pattern of clusters in populations.
The purpose of LPA is to reduce the data to a reasonable number of classes, mean ingful both in terms of statistical saliency and theoretical interpretability, and in our work we define the term "optimal number of clusters" as a number that is both statisti cally justifiable and substantively meaningful. Nylund, Asparouhov, and Muthén (2007) conducted a thorough study to compare existing model selection criteria for LCA and demonstrated that the Bootstrap Likelihood Ratio Test (BLRT; McLachlan & Peel, 2000) was the best performing criterion to determine an optimal number of clusters of the mixture models including the Maximum Likelihood LCA (ML-LCA). However, Nylund et al. (2007) also pointed out that BLRT is not commonly implemented in mixture mod eling software, generally requires substantial computation time and sometimes leads to incorrect p-value estimations. Nylund-Gibson and Choi (2018) further demonstrated that the BLRT and the Vuong-Lo-Mendell-Rubin (VLMR) likelihood ratio test (Lo, Mendell, & Rubin, 2001) perform well in determining an optimal number of clusters. In addition, Spurk, Hirschi, Wang, Valero, and Kauffeld (2020) reviewed the best practice for selecting an optimal number of clusters extracted by LPA. Spurk et al. (2020) also highlighted the importance of assessing whether increasing the number of clusters lead to new typologi cal profiles that generate meaningful insights (Berlin et al., 2014;Vermunt & Magidson, 2002). Finally, Albers, Mørup, Schmidt, and Glückstad (2020) recently demonstrated that "the Bayesian LCA can be used in a predictive approach to model order selection, in which an appropriate number of clusters is identified by optimizing the predictive per formance on held-out data" (p. 4). These previous works indicate that the determination of an optimal number of clusters is still a challenging issue, and that approaches suitable for determining a number of cluster of the mixture models are underexplored.
LCA and LPA are often used by psychologists and sociologists to classify individuals into specific personality types (Glückstad, Schmidt, & Mørup, 2017;Magun, Rudnev, & Schmidt, 2016;Moors & Vermunt, 2007;Petersen et al., 2019;Spurk et al., 2020). Some of these works (Magun et al., 2016;Rudnev et al., 2014) used the Portrait Values Ques tionnaire (PVQ-21) of the Schwartz theory of ten basic human values (Schwartz, 2006(Schwartz, , 2012 to conduct a typological analysis. For example, Rudnev et al. (2014) implemented an extended version of LPA that adds a "method factor" loaded on the 21 PVQ items  and investigated the invariance of the Schwartz value typologies across the 4th-, 5th-, and 6th rounds of the European Social Survey (ESS). Similarly, Magun et al. (2016) analyzed 29 European countries from the 4th round of the ESS using a factor mixture model (Muthén, 2008). Both analyses used AIC, BIC, Entropy and VLMR as model selection criteria and identified five clusters characterized with typologies representing only part of the abstract value orientations of the Schwartz ten value factors (Schwartz, 1992;Schwartz et al., 2012).
The majority of these previous works use ML-LPA that estimates "conditional means and variances of the continuous indicators" (Nylund-Gibson & Choi, 2018, p. 454). In this work we first consider a so-called, variational Bayesian approach to statistical inference in the LPA model (VB-LPA) that optimizes a criterion called the Evidence Lower Bound (ELBO), a lower bound on the marginal likelihood. This is of particular interest because the ELBO itself is a statistically consistent criterion for model order selection. Second, we implement several cluster evaluation metrics that can be used to assess quality of clusters extracted from a wide range of clustering algorithms, not only the aforementioned LCA/LPA but also other heuristic models such as k-means. For comparing qualities of the clusters extracted by VB-LPA and ML-LPA, we demonstrate how to identify an optimal number of clusters extracted by the two approaches. To demonstrate the usability of the VB-LPA and the cluster evaluation metrics, we statistically evaluate typologies of the European population based on the 21 PVQ items (Schwartz, 2012) included in the 8th round of the European Social Survey (ESS8) dataset (Jowell, Roberts, Fitzgerald, & Eva, 2007). Accordingly, the next section reviews the Schwartz theory of ten basic human values, followed by the brief description of data, the methods applied to the current work, the analyses, and the results. We finalize with a discussion of our findings. Schwartz (1992Schwartz ( , 2006 proposed a theory summarising "ten motivationally distinct value orientations" organised as a circular structure (see Figure 1). Each value has a unique motivation underlying it. Closer values in the value circle are expected to share simi lar motivations, and opposing values are expected to have contradictory motivations.

Schwartz Theory of Ten Basic Human Values
The values can be summarized into four higher-order dimensions: self-transcendence consisting of "universalism" and "benevolence" that opposes self-enhancement consisting of "power" and "achievement"; conservation consisting of "tradition", "conformity", and "security" that opposes openness to change consisting of "stimulation" and "self-direc tion"; and "hedonism" lies between openness to change and self-enhancement. As shown in Figure 1, these four higher-order values are further classified into the dimensional dichotomies of (i) social-focus vs. personal-focus; and (ii) growth vs. self-protection.

Method Data
In this study we analyzed the secondary data compiled from the 8th round of the ESS (Jowell et al., 2007). From the original sample consisting of 44,387 respondents, 3,094 respondents with incomplete responses to the PVQ items were excluded. The PVQ21 consists of 21 questions responded to on a 6-point Likert scale. To adjust for individual response styles, previous studies (Rudnev et al., 2014) employed procedures such as centering (Smith & Schwartz, 1997) and a "method factor" (Magun et al., 2016;Vermunt, 2010). For the demonstration of the VB-LPA and the cluster evaluation metrics, the current study used the Confirmatory Factor Analysis (CFA) to compute factor scores representing the Schwartz ten basic values for the respective respondents (N = 41,293) from the raw Likert scores of the 21 PVQ questions, following the instructions given in Schwartz (2014). The factor scores for the ten values provided the raw input for the LPA (ML-LPA and VB-LPA). Syntax, fit summary and parameter estimates of the CFA are found in the Supplementary Materials (Appendices 4, 5, and 6).

Variational Bayesian Latent Profile Analysis (VB-LPA)
Assuming that the dataset is organized in a way that the observations (i.e., respondents of the ESS8) are the rows and the indicators (i.e., Schwartz' value indicators) are the columns of a matrix containing the CFA scores for each respondent, VB-LPA aims to group the rows to form homogeneous blocks. VB-LPA is specified according to the following generative model: Probability of belonging to each cluster.
λ km~G amma a, b Precision (inverse variance) of each cluster profile.
A nm~N ormal μ z n m , λ z n m −1 Response for each respondent and each indicator.
where n, m, and k refer to each respondent, question, and latent profile respectively. The model posits that there exists K profiles, each characterized by a mean value and a precision for each indicator, and that each respondent belongs to one of these clusters with some prior probability. Bayesian inference in the model involves estimating the posterior distribution of all the model parameters, conditioned on the observed data matrix A. Since exact Bayesian inference is intractable in this model, the Bayesian posterior distributions must be approximated. Let all model parameters be denoted by θ = π k k ∈ 1…K , z n n ∈ 1…N , μ km , λ km k ∈ 1…K, m ∈ 1…M . (6) In the variational Bayesian (VB) approach to approximate Bayesian inference, we assume that the posterior can be well represented by a factorized probability distribution To fit the model, we optimize the ELBO given by where p A θ and p θ are the likelihood and prior as defined by the generative model. This corresponds to minimizing the Kullback-Leibler divergence between the true poste rior and the variational approximation. The ELBO is optimized iteratively using closed form fixed-point updates for each term in the factorization, while keeping all other terms fixed.
Benefits of VB include the ability to explicitly monitor convergence as well as ease of interpretation accounting for uncertainty according to the inferred factorized distribu tion, although the assumed factorized distribution is only an (optimistic) approximation of the uncertainty and prone to local minima issues in the optimization.

ML-LPA
Using the Mplus software implementation (Muthén & Shedden, 1999;Muthén & Muthén, 2017), we carried out the ML-LPA for solutions in the range from K= 3 to K= 20 with the default settings of equal variances across classes and zero covariance between the indicators for each class. To mitigate the potential local maxima issues, Mplus recom mends that "the best log-likelihood value should be replicated in at least two final-stage solutions" (Spurk et al., 2020); to ensure this, we used 500 random restarts for 3 ≤ K ≤ 7, 1 500 restarts for 8 ≤ K ≤ 17, and 2 000 restarts for 18 ≤ K ≤ 20. We computed fit per formance scores including AIC, BIC and Entropy, and we further computed VLMR- (Lo et al., 2001) and BLRT scores (McLachlan & Peel, 2000) that calculate p-values indicating whether the fit of K classes significantly increases from K − 1 classes. The syntax for the Mplus analysis is provided in the Supplementary Materials (Appendix 7).

VB-LPA
Using our own software implementation (software implementation of the VB-LPA is available in Python in the Supplementary Materials, folder VB_PLA) we carried out the VB-LPA for models in the range from K= 3 to K= 20 with hyperparameters fixed at α = a = b = λ 0 = 1. The iterative fixed point update of the parameters were run until the relative improvement of the ELBO of a full cycle of updates of all parameters was less than 10 −7 or until completion of 1,000 full update cycles. To mitigate effects of local sub-optimal solutions, we repeated the procedure 100 times using different random initi alizations and selected the solution with the highest ELBO as optimal. We verified that in all cases, the best solution was replicated to confirm that 100 repeats was adequate.

Cluster Evaluation Metrics
To evaluate the extracted clusters, we consider several generic metrics.

Maximum Profile Correlation
When the number of clusters is low, we expect the extracted profiles to be clearly distinct, but as the number of clusters increases, we expect that the extracted profiles will gradually become more similar. To assess this, we examined the maximum Pearson correlation coefficient between cluster profiles: The correlation coefficient will be equal to 1 when two cluster profiles are identical up to scaling and translation.

Within-and Between-Cluster Variance
In general, a clustering solution is good when the observations in each cluster are very similar while the clusters are very dissimilar. To assess this we examined the average within-cluster variance S = 1 K ∑ k=1 K S k , S k = 1 N k ∑ n z n =k a n − μ k 2 , and the variance between the cluster profiles

Davies-Bouldin Score
The Davies-Bouldin (DB) score is a measure of the trade-off between the within and between cluster variance, defined as the average similarity of each cluster to its closest neighbor where the similarity measure is defined as the ratio of within and between cluster spread

Entropy
Entropy, as defined in Mplus, is a statistic used to measure the uncertainty of the assignment of observations to cluster. More precisely Entropy is an inverted relative measure of entropy defined as where the normalization log K is the maximum entropy obtained for a uniform distribu tion, and H z is the information entropy of the categorical distribution, Here, p k denotes the probability that the observation z is assigned to cluster k. The Entropy is in the range 0 to 1 and attains its maximum when there is no uncertainty in the cluster assignment.

External Validation
External variables, that are expected to correlate with the value-based cluster assign ment, can be used to validate the clustering and aid in determining the most appropriate number of clusters. To measure the degree of statistical relation between the cluster assignment and external categorical variables we used adjusted normalized mutual infor mation (AMI) defined as where p x is the fraction of observations that fall in category x in the label X , and p xy is the fraction of observations that fall in categories x and y in labels X and Y respectively. Mutual information measures the degree of statistical association between random variables and quantifies the amount of information (measured in nats) that the variables have in common. The AMI adjusts the mutual information for association due to chance, and scales the measure so that AMI = 1 corresponds to perfect agreement and AMI = 0 is the expected agreement due to chance when there is no association. Since mutual infor mation takes the uncertainty related to the estimated cluster assignments into account, we find it well suited to measure the statistical relation between cluster assignments and external variables. Table 1 shows the fit summary of the best log-likelihood solutions replicated at least twice. Both AIC and BIC scores continuously decrease, indicating that more than 20 clus ters is appropriate. The entropy scores increase to K= 16 where it levels off. According to the VLMR test, the first non-significant increase in model order is at K= 9, indicating that a model with K = 8 should be examined further. The BLRT tests required substantial computational time, which all resulted in P = 0.0000. In addition, for the range between K= 18 and K= 20, some (out of five) bootstrap draws had both a smaller LRT value than the observed LRT value and not a replicated best log-likelihood value.

VB-LPA Fit Performance
The model fit in terms of the highest attained ELBO is shown in Figure 2. The ELBO in creases continuously with a higher number of clusters, indicating that there is statistical support in the data for more than K = 20 clusters.

Profile Correlation
The maximum Pearson correlation between cluster profiles is shown in Figure 3A. For ML-LPA at K= 7 and for VB-LPA at K= 6, the maximum correlation steeply increases from less than 0.7 to more than 0.9, indicating that above this threshold the methods discover two latent profiles which are close to identical up to a scaling and translation, which is also evident in the plot of the profiles in Figure 4.

Within-and Between-Cluster Variance
The variance within and between clusters is shown in Figure 3B. As the number of clusters increases, the clusters become more dispersed (higher between-cluster variance) and contain fewer, more similar observations (lower within-cluster variance) as could be expected. The between-cluster variance is significantly lower for the VB-LPA, presum ably because of the regularizing effect of the Bayesian estimation. The curves do not exhibit any conspicuous notches or plateaus which warrant further examination.

Davies-Bouldin Score
The minima of the DB score shown in Figure 3C indicate model orders that achieve the best trade-off between within-and between-cluster variance. For both models, the best DB score is attained at K= 8. For the ML-LPA, model orders K= 4, 9, 20 and 5 are close contenders, but for the VB-LPA, K= 8 is optimal with a large margin to the second-best solution at K= 7.

Entropy
The certainty with which the observations are assigned to their respective clusters, as measured by the Entropy, is shown in Figure 3D. The Entropy curves for the two methods are similar, with Entropy for VB-LPA consistently around a percentage point higher than ML-LPA. Models with low Entropy should be rejected because the assign ment of observations to clusters should be unequivocal in order to interpret the latent factors. ML-LPA and VB-LPA reach an Entropy score above 0.9 around K= 9 and K= 6, respectively. Both methods yield a large increase in Entropy for K= 6.

Value Profiles
The results of cluster evaluation indicate that K= 8 seems to be one of the optimal numbers of clusters for the two LPA techniques, although the Entropy score of K= 8 for ML-LPA did not reach the level of 0.9. Figure 4 illustrates the fitted model profiles (i.e., value typologies) for K = 3…8 (specific Schwartz typologies are provided in the Supplementary Materials, Appendix 1). To aid the visualization, the profiles are color matched by greedy minimization of squared error, which enables us to inspect what type of clusters emerged when the number of clusters increased from K= 3 to K= 8. Generally, the typologies of the clusters shown are substantially similar between ML-LPA and VB-LPA (see "Overlap between ML-LPA8 and VB-LPA8" in Supplementary Materials, Appendix 1). At the starting point, K= 3 consists of three clusters characterized as (i) positive to all values but especially positive to achievement; (ii) positive only to social-focus; and (iii) negative to all values but especially negative to openness to change.
When the number of clusters increased to K= 5, the pattern of the second cluster was slightly modified to (iii) negative to all values but especially negative to power, and two  . When the number of cluster further increased to K= 7, the second cluster (ii) only positive to social-focus split into two: (ii) moderately positive only to social-focus and (vi) strongly positive only to social-focus. In addition, a new cluster (vii) moderately positive only to personal-focus was introduced. Finally, when K= 8 was reached, the fifth cluster, (v) positive only to growth split into (v) strongly positive only to growth and (viii) moderately positive only to growth. Importantly, although K= 3 was dominated by two clusters characterized as (i) positive to all values and (iii) negative to all values, the increased number of clusters reduced the sizes of these two clusters, and subsequently revealed additional value typologies meaningfully related to the Schwartz circular value structure. Finally, Figure 5 depicts histograms of the cluster sizes for K = 3…20. Clusters ob tained using VB-LPA are considerably more uniform in size; for example, for K ≥ 14, the smallest cluster obtained using ML-LPA covers 1% or less of the observations, whereas the smallest cluster in VB-LPA never covers less than 2% of the observations.

External Validation
Plots of AMI between cluster assignments and nine external variables are shown in Figure 6. In all cases, AMI increases from the K= 3 solution and most either peak or level out around K= 5-10. Two variables (b, h) have a global maximum at K= 8-9 for both methods. In the ML-LPA for several variables (a, b, c, e, f, i), AMI exhibits a local peak at K= 5, but this effect is absent in the VB-LPA. In the VB-LPA for several variables (b, c, e, f, g, h, i), AMI levels off at K= 8-9.

Discussion and Conclusion
This article presented the VB-LPA algorithm and several statistical criteria inspired by Spurk et al. (2020) which can be used to evaluate content of the clusters and select an optimal number of clusters for various clustering techniques. Our work first reviewed the performance fit of the models obtained from the two LPA techniques. One of the criteria suggested by Nylund et al. (2007), i.e., the VLMR likelihood ratio test (Lo et al., 2001), selected K= 8 as an optimal number of clusters. On the other hand, widely used criteria such as the AIC the BIC and the Entropy for the ML-LPA as well as the ELBO criterion for the VB-LPA were not able to identify a specific optimal number within the range of K = 3…20. The criteria evaluating the maximum profile correlation, cluster variance and entropy indicated that, when the number of clusters is between 3 and 5, typological structures were not clearly identified. This may be due to the characteristics of the particular dataset that is rather homogeneous. However, when the number of clusters reached the level of K = 6…8, contents of the clusters started to stabilize. The AMI analysis further suggested that when the number of clusters reached the level of K = 8…9, the external variables measuring attitudes and subjective opinions started to level out. On the other hand, the AMI scores for the demographic variables leveled out when the number of cluster reached at K = 5. The qualitative evaluation of the value profiles demonstrated that, for instance, the all positive or the all negative clusters identified in K= 3 contain potentially meaningful clusters such as strong growth, moderate growth, strong personal-focus, moderate person al-focus that appeared when the number of clusters increased to K= 8. The AMI and the qualitative inspection summarized in Supplementary Materials (Appendix 3) confirmed that these additional profiles identified in K= 8 provided additional meaningful insights. Our findings therefore encourage researchers to evaluate if additional profiles obtained from the increased number of clusters provide meaningful insights in relation with the

AMI Between Clustering and External Variables
Note. AMI = Adjusted normalized mutual information. external variables researchers intend to investigate, before they decide on an optimal number of clusters, as Spurk et al. (2020) suggested. The AMI criterion can be especially useful in this respect for evaluating the abilities of the clustering solutions to predict external variables when researchers decide on an optimal number of clusters obtained from not only LPA or LCA but also from other clustering techniques such as k-means and hierarchical clustering techniques.
As for the comparison between the ML-LPA and the VB-LPA (summary of the com parisons is found in the Supplementary Materials, Appendix 3), our study identified several differences in the two algorithms. First, the sizes of clusters identified by the ML LPA were varied, while the clusters identified by the VB-LPA were regulated. Second, the ML-LPA achieved the higher between-cluster variance scores and the lower Davies-Boul din scores compared to the VB-LPA. This indicates that the regulated size distribution achieved in VB-LPA might have influenced the scores of the lower between-cluster variance. Third, the entropy scores for the VB-LPA were generally higher than those for the ML-LPA. These differences might be explained by the assumption of equal variances between classes in the ML-LPA, where the VB-LPA assumes unequal variances. Finally, in terms of the computational time, the ML-LPA required the random starts up to 2 000 times for the higher number of clusters, while the VB-LPA required only 100 random starts for the range of K = 3…20.
A limitation of our study is that the comparisons of the ML-LPA and VB-LPA were not based on a simulation study but only on empirical data instead. The comparisons using the PVQ items based on the cluster evaluation metrics, however, enabled us to identify an optimal number of clusters that are statistically justifiable and substantively meaningful for interpreting the Schwartz theory. Another noteworthy point is that the ML-LPA based on the method factor presented in Magun et al. (2016), Vermunt (2010) as well as a test run of ML-LPA using the centering (Smith & Schwartz, 1997) scores of the PVQ items both resulted in a smaller number of clusters as an optimal. On the other hand, our study using the CFA scores identified K = 8 as an optimal number of clusters. The different optimal numbers of clusters identified from different data-sets may be due to the fact that the method factor and the centering techniques are usually used to eliminate the response biases, whereas the CFA scores inherently distinguished several Schwartz typologies with either positive or negative response styles. Some recent works (See, Klimstra, van den Bergh, Sijtsema, & Denissen, 2021) consider hierarchical tree typology that distinguishes high-and low response styles of pathological personality traits. Hence, it is debatable whether the response styles should be eliminated within the clustering process. This clearly indicates that there is room for further investigation on various data pre-processing procedures. Finally, our study did not consider a covariance structure among the indicators, which can represent the underlying dimensionality of the Schwartz values in addition to classification. Thus, future research comparing ML