^{1}

^{1}

The analysis of change within subjects over time is an ever more important research topic. Besides modelling the individual trajectories, a related aim is to identify clusters of subjects within these trajectories. Various methods for analyzing these longitudinal trajectories have been proposed. In this paper we investigate the performance of three different methods under various conditions in a Monte Carlo study. The first method is based on the non-parametric k-means algorithm. The second is a latent class mixture model, and the third a method based on the analysis of change indices. All methods are available in R. Results show that the k-means method performs consistently well in recovering the known clustering structure. The mixture model method performs reasonably well, but the change indices method has problems with smaller data sets.

The analysis of change in individuals and the development in time of groups of individuals is important in many research fields. With the emergence, or rather increased popularity of intensive longitudinal designs such as the Experience Sampling Method (ESM) (

Longitudinal data may be derived from experimental or observational studies. In longitudinal experiments differential between-subjects effects over time are usually the primary focus, and the level or the shape of the trajectory is often secondary. In observational studies the level or shape of the trajectory is often central to the analyses and comparisons between realistic groups (e.g. men versus women, educational level) of secondary importance. Another type of question is about differences between groups that were not defined beforehand, but are derived from the data (

To analyze change a construct must be measured repeatedly, which will yield a set of trajectories, one for each subject. Other than in more general multivariate data, the repeated measures in longitudinal data have a dependence among measurements due to the ordering in time and therefore traditional regression techniques cannot be applied, since they assume independent observations. Another aspect in methods popular in the social sciences such as ESM is that this type of longitudinal data also differs from time series data because instead of a few random processes uniformly sampled over time, the more general longitudinal data consists of a large number of independent trajectories that are potentially irregularly sampled over time. Although it is possible to aggregate the individual change over time to a mean change over time, it is also possible to analyze the differential trajectories of growth between individuals in order to, for example, identify subgroups for which an intervention is successful.

With such longitudinal data the research questions are about within-subjects change such as: what is the individual and general level of change in the construct of interest, and what pattern or shape does this change have? But questions can also concern the between-subjects differences in change, such as: do some (groups of) individuals have a different level or pattern of change. The patterns of change can be captured in a functional form (parametric), usually a linear or quadratic pattern, but non-parametric patterns that describe the trajectories are also possible. When questions about the level and patterns of change are answered, the next level of interest lies in predicting these change levels or patterns by covariates.

One way to analyze growth trajectories is to cluster them into partitions that reflect different trajectories of growth within a population (

Given the increased focus on intensive longitudinal design there is to be expected that researchers used to standard statistical software packages shift their attention to more flexible platforms for such data analysis such as the statistical programming language R (

In this article we will focus on three popular methods for longitudinal cluster analysis that are available in R (Version 4.1.0), reflecting different methods for clustering longitudinal data. These methods are: (1) the

Generally statistical models that yield as outcomes an overall fit line and a distribution of fitted lines in longitudinal data are called latent growth models (LGM) in the context of Structural Equation Modeling (SEM). A fundamental concept in SEM is the modelling of factors that are

Another variant of this type of model is the Growth Mixture Model (GMM), where the term mixture refers to it being a mixture of (latent) growth models. GMM is a framework within the multilevel modeling (MLM) literature, which approaches growth through the separation of variance into fixed and random effects. In longitudinal data a fixed effect assumes that the model intercept is time-invariant, and a random effect allows for testing whether the intercept is likely not time-invariant, i.e. suggests growth. Despite the different approaches to longitudinal data between LGM and GMM, from a comparison of the basic equations underlying both models, it can be seen that these two types of models are in their basic form essentially the same, see e.g.

The latent class growth model (LCGM) and GMM are closely related, where LCGM is a special type of GMM (e.g.

The first method to explore is the

The

The second non-parametric method, implemented in the R-package

The third method is the

In this study the function ‘hlme’ from the

We have given a very brief overview of three packages for longitudinal cluster analysis in R. There is little knowledge about how these three methods perform relative to each other and which method might be preferred by researchers interested in longitudinal clustering of their data. As such the aim of this study is to compare the quality of clustering solutions between these three methods in R, in order to identify the strengths and weaknesses of each method and help practitioners in making an informed choice among these methods.

In order to compare the methods our research question is: do the three methods, namely

This study is not the first one that compares different longitudinal clustering methods. For instance,

In order to test the different longitudinal clustering methods data were simulated using a Monte Carlo method. The data generating procedure was to first define the number of clusters (

More specifically, these are the trajectories of the six clusters

• stable low:

• linear growth:

• quadratic decline and increase:

• stable high:

• linear decline:

• quadratic increase and decline:

Where

The Monte Carlo simulation subsequently varied the data sets on a fixed number of repeated measurements or time points (

The total number of cells in the simulated design were 2 (clusters) x 2 (time points) x 2 (subjects) x 2 (levels of error) = 16. For each method a new series of Monte Carlo simulated datasets were constructed, bringing the total number of cells in the design when also accounting for the clustering methods to 48. The number of replicated data sets per cell was

In

The Rand index (

Here,

The Calinski-Harabasz index (

For a given number of clusters,

where,

Higher values of

Here, the factor

where

A fourth measure to compare the methods on is the variability of the

The results of the Monte Carlo study for the four indices are presented in the tables below.

First, the recovery of the true clustering using the

Condition | Method | Error = 0.5 |
Error = 1.0 |
||
---|---|---|---|---|---|

Cluster = 3 | Cluster = 6 | Cluster = 3 | Cluster = 6 | ||

T = 5, S = 25 | kml | 0.929 | 0.893 | 0.521 | 0.575 |

traj | 0.483 | 0.540 | 0.220 | 0.272 | |

lcmm | 0.574 | 0.852 | 0.416 | 0.572 | |

T = 5, S = 50 | kml | 0.939 | 0.908 | 0.568 | 0.612 |

traj | 0.490 | 0.547 | 0.237 | 0.267 | |

lcmm | 0.901 | 0.886 | 0.473 | 0.613 | |

T = 10, S = 25 | kml | 1.000 | 1.000 | 0.960 | 0.958 |

traj | 0.678 | 0.891 | 0.228 | 0.444 | |

lcmm | 0.962 | 0.916 | 0.946 | 0.897 | |

T = 10, S = 50 | kml | 1.000 | 1.000 | 0.970 | 0.971 |

traj | 0.684 | 0.907 | 0.243 | 0.439 | |

lcmm | 0.978 | 0.986 | 0.970 | 0.946 |

The

With respect to the number of cluster, in particular the

The number of time points has a consistent effect on the

Next, the

Condition | Method | Error = 0.5 |
Error = 1.0 |
||
---|---|---|---|---|---|

Cluster = 3 | Cluster = 6 | Cluster = 3 | Cluster = 6 | ||

T = 5, S = 25 | kml | 101.540 | 272.537 | 32.774 | 79.731 |

traj | 49.804 | 63.464 | 13.176 | 30.711 | |

lcmm | 69.031 | 249.348 | 26.306 | 77.025 | |

T = 5, S = 50 | kml | 203.608 | 546.643 | 61.941 | 155.336 |

traj | 97.800 | 126.975 | 26.648 | 59.197 | |

lcmm | 192.796 | 519.823 | 52.909 | 150.584 | |

T = 10, S = 25 | kml | 152.579 | 252.779 | 39.510 | 64.078 |

traj | 76.515 | 171.461 | 9.131 | 28.938 | |

lcmm | 144.360 | 202.191 | 38.413 | 59.935 | |

T = 10, S = 50 | kml | 304.657 | 503.114 | 77.968 | 127.321 |

traj | 158.189 | 356.767 | 17.899 | 56.819 | |

lcmm | 294.111 | 487.246 | 77.279 | 123.306 |

The results of the

The results for the standardized Calinski-Harabasz Index for the three methods are given in

Condition | Method | Error = 0.5 |
Error = 1.0 |
||
---|---|---|---|---|---|

Cluster = 3 | Cluster = 6 | Cluster = 3 | Cluster = 6 | ||

T = 5, S = 25 | kml | 0.321 | 0.595 | 0.087 | 0.175 |

traj | 0.161 | 0.159 | 0.035 | 0.068 | |

lcmm | 0.220 | 0.550 | 0.070 | 0.169 | |

T = 5, S = 50 | kml | 0.322 | 0.598 | 0.083 | 0.171 |

traj | 0.159 | 0.159 | 0.036 | 0.065 | |

lcmm | 0.305 | 0.573 | 0.071 | 0.165 | |

T = 10, S = 25 | kml | 0.215 | 0.319 | 0.048 | 0.073 |

traj | 0.109 | 0.220 | 0.011 | 0.033 | |

lcmm | 0.204 | 0.256 | 0.047 | 0.068 | |

T = 10, S = 50 | kml | 0.215 | 0.318 | 0.048 | 0.072 |

traj | 0.112 | 0.228 | 0.011 | 0.032 | |

lcmm | 0.207 | 0.308 | 0.047 | 0.070 |

The results for the

The results for the standard deviation of the ARI values across replications for the three methods are given in

Condition | Method | Error = 0.5 |
Error = 1.0 |
||
---|---|---|---|---|---|

Cluster = 3 | Cluster = 6 | Cluster = 3 | Cluster = 6 | ||

T = 5, S = 25 | kml | 0.058 | 0.072 | 0.117 | 0.057 |

traj | 0.170 | 0.071 | 0.126 | 0.090 | |

lcmm | 0.380 | 0.108 | 0.191 | 0.086 | |

T = 5, S = 50 | kml | 0.039 | 0.060 | 0.086 | 0.049 |

traj | 0.142 | 0.065 | 0.087 | 0.090 | |

lcmm | 0.133 | 0.080 | 0.178 | 0.057 | |

T = 10, S = 25 | kml | 0.002 | 0.000 | 0.045 | 0.049 |

traj | 0.327 | 0.079 | 0.167 | 0.071 | |

lcmm | 0.135 | 0.135 | 0.103 | 0.128 | |

T = 10, S = 50 | kml | 0.000 | 0.000 | 0.027 | 0.026 |

traj | 0.354 | 0.058 | 0.152 | 0.049 | |

lcmm | 0.094 | 0.066 | 0.041 | 0.077 |

The variability across the replications is smallest for the

This study compared three different methods for longitudinal cluster analysis and focused on three corresponding R-packages that are available on the R-repository CRAN. Studies have compared model-based longitudinal clustering methods (

Our findings imply that longitudinal

In our Monte Carlo approach, the number of clusters to be found was set equal to the number of clusters that were generated. Our study can be regarded as a comparison between methods in a best-case scenario: when the number of clusters in the analyses match the number of clusters in the population. The limitation of our approach is that it is unclear how well the methods and packages perform when used to explore an unknown number of clusters. It would be worthwhile to evaluate the performance of these methods and packages when used to recover the number of clusters. This has already been addressed in other studies (

In similar vein the sample sizes in our Monte Carlo simulation were small (

The Calinsky-Harabasz index as used in this study is not optimal for comparing methods. The CH-index’s strength lies in the process of finding the optimal number of clusters. As such the CH-index is more useful for within-dataset comparisons rather than between method-comparisons across different data sets. We still opted to include the CH-index because researchers applying clustering methods to their data are using CH-index, and we wanted to explore how the CH-index performs, even when applied to the best-case scenario with the true number of clusters. To allow for between method comparisons we standardized the CH-index between 0 and 1. In future studies it would be worthwhile to explore the performance of the different methods when using the CH-index in choosing the number of clusters. In this study the CH-index results in conclusions which are congruent with the other measures of fit, adding support to our finding that both

One of our study’s strengths is the use of simulated data, which carries the advantage that the underlying clustering structure is known. It is unclear, however, whether real data and the impact of problems in real data, such as uni- and multivariate outliers and noncentrality of residuals would significantly alter this study’s conclusions. By varying the measurement error in the simulated data this study has attempted to reflect the noise in growth patterns in real data. However, in empirical longitudinal data noise levels may sometimes be larger or other sources of bias may influence the results, such as selective drop-out.

The present study did not concern the question of measurement invariance (

The present study was limited to the freely available and open source R-packages, but besides the R-environment there is more software that can used for longitudinal clustering. For instance, MPLUS (

For this article, the R code used to construct the data sets and to run the Monte Carlo simulations is available via PsychArchives (for access see

The authors have no funding to report.

Peter Verboon is a member of Methodology's Editorial Board, but played no editorial role for this particular article or intervened in any form in the peer review procedure.

The authors have no additional (i.e., non-financial) support to report.