Over the past 30 years, structural equation models (SEMs) have become common for analyzing longitudinal data. For instance, growth curve models or latent change models are prominent examples and are special cases of SEMs (for more examples, see e.g., Newsom, 2015). However, less is known regarding the direct connection between SEMs and univariate repeated measures ANOVA (U-RM-ANOVA). More specifically, even though it is well known that both regression and ANOVA are special cases of SEM, to our knowledge, there is no documented approach on how to test main and interaction effects commonly implemented in U-RM-ANOVA using SEM. Although the connection between U-RM-ANOVA and SEM may seem trivial, there are at least three challenges that quickly become apparent while attempting to identify that connection. First, it is not obvious how to; (1) impose or test sphericity in SEM, (2) test the main/interaction effects of U-RM-ANOVA (e.g., test the main/interaction effects of, say, a 2 × 3 fully within subjects repeated measures design), and (3) impose sphericity on latent variables (i.e., perform the U-RM-ANOVA using latent rather than manifest variables). This article tackles these challenges by identifying and demonstrating how to perform each in SEM, and thereby builds on the expansive range of literature on methods for analyzing repeated measures using SEM (e.g., growth curve models, McArdle, 1988; McArdle & Epstein, 1987; Meredith, 1993; latent change models, McArdle, 2009; McArdle & Hamagami, 2001; Raykov, 1999; Steyer et al., 1997; for an overview, see Newsom, 2015).
Furthermore, there are at least four benefits to formally identifying the connection between U-RM-ANOVA and SEM. First, sphericity is often a difficult concept for researchers to grasp, and has a colloquial definition based on the variances of differences between all possible pairs of within-subject conditions (e.g., Field, 1998; Field et al., 2012; Lane, 2016; Nimon, 2012) that only holds in designs with one factor. In contrast, this article demonstrates how sphericity may be specified in SEM, which may help researchers understand its meaning. Second, researchers may also test and/or relax the assumptions of sphericity in SEM (without having to use post-hoc adjustments that are common to U-RM-ANOVA; e.g., Greenhouse-Geisser (Greenhouse & Geisser, 1959) and Huynh-Feldt (Huynh & Feldt, 1970) adjustments. Third, once in the SEM framework, researchers may capitalize on the other benefits of SEM, such as built-in approaches to handle both missing data and adjustments for non-normality (see the Conclusions and Future Directions section for further detail). Fourth, and which will be demonstrated in the article, the SEM framework allows for measurement models, which researchers may want to use in cases wherein manifest variables likely contain measurement error.
Although the analysis of repeated measures via SEM is not new, virtually all of the literature describes analyses in terms of models for repeated measures across time (e.g., growth curves, McArdle, 1988; McArdle & Epstein, 1987; Meredith, 1993; latent change models, McArdle, 2009; McArdle & Hamagami, 2001; Raykov, 1999; Steyer et al., 1997). These analyses are vitally important, but they do not apply to all empirical investigations. For example, many experimental psychologists aim to examine repeated measures across treatment or other experimental conditions, and the repeated measures do not follow a function of time (as assumed in growth curves and latent change models). Hence, this article builds the SEM literature for repeated measures by reconsidering how U-RM-ANOVA (which is not constrained to a specific ordering of time) can be implemented in SEM.
We limit the scope of this article in two ways. First, the article only considers within-subjects designs in order to focus on the sphericity considerations that underlie U-RM-ANOVA, and because mixed designs have been described elsewhere (see Langenberg et al., 2020). Yet, for interested readers, the discussion section briefly characterizes how to include between-subjects factors, and points to other articles that describe this topic in greater detail. Second, because there is extensive literature on these topics, this article assumes that readers have knowledge of the identification of latent variables in the SEM framework, and considerations of measurement invariance to compare means of latent variables across repeated measures. Relevant citations are provided (e.g., Newsom, 2015; Pitts et al., 1996; Widaman et al., 2010).
The remainder of this article includes: (1) a guiding empirical example, (2) a review of orthogonal contrasts (which will be central to testing both sphericity and main/interaction effects in SEM), (3) a review of sphericity, (4) how to impose, test, and relax sphericity in SEM, (5) a simulation study that compares Mauchly’s sphericity test to SEM, (6) how to test main/interaction effects of U-RM-ANOVA in SEM (including how to reproduce F-values from U-RM-ANOVA in SEM), (7) a simulation study that compares RM-ANOVA to SEM with and without assuming sphericity, (8) an extension of SEMs that include measurement models, and (9) some conclusions and future directions. By the article’s close, readers will have a clear understanding of sphericity, how it may be imposed in SEM, and how to test the hypotheses of U-RM-ANOVA in SEM.
Guiding Empirical Example
Here we describe an empirical example to guide readers through the remainder of the article. The example stems from a study aimed to investigate the development of different processes involved in reading; including both the necessary motor skills for fixating on a sentence, and the cognitive skills to process the sentence. A total of N = 268 children were asked to read a set of sentences during Grades One, Two, and Four (each child completed three repeated measures, N = 169 children had complete data). Among other variables, the researchers measured mean gaze and mean total viewing duration. The variables were log-transformed and standardized.
The comparison (i.e., control) condition used “Landolt sentences” (e.g., Heim et al., 2018; Hillen et al., 2013), which simply replace each character of a regular sentence with a circle. The difference in mean gaze and mean total viewing duration across regular and Landolt sentences measures processing time (i.e., processing time for a regular sentence includes time for both sentence fixation and processing; whereas processing time for Landolt sentences only includes time for sentence fixation).
The experimental design conforms to a 2 × 3 repeated measures design, with sentence type (Factor A: A1 = real sentences; A2 = Landolt sentences) and grades (Factor B: B1 = Grade One; B2 = Grade Two; B3 = Grade Four) as within-subject factors. Figure 1 displays mean gaze duration (solid line) and the mean total viewing duration (dashed line) for the two sentence types across the three measurement occasions. In the next section, we only focus on the dependent variable mean gaze duration. The section Testing Main and Interaction Effects of U-RM-ANOVA Using L-RM-ANOVA uses both dependent variables as indicators of a common latent variable in a measurement model.
Figure 1
Standardized Mean Gaze Duration and Mean Total Viewing Duration for the Two Sentence Types
Note. The solid line indicates standardized mean gaze duration; the dashed line indicates mean total vfiewing duration for the two sentence types (left panel: regular sentences; right panel: Landolt sentences) across the three measurement occasions. Error bars indicate standard errors.
Review of Orthogonal Contrast Matrices
A matrix of orthogonal contrasts will need to be included in SEM to test both the assumption of sphericity, and the main/interaction effects. This section reviews orthogonal contrast matrices in preparation for their inclusion in SEM later in this article.
For R repeated measures (in the empirical example, R = 6) a set of R − 1 orthogonal contrasts may be constructed to test the main and interaction effects of U-RM-ANOVA, and are usually organized into an (R − 1) × R matrix. For example, if the data from the empirical example are organized into an R dimensional column vector, denoted y, whose elements conform to the order {A1B1, A1B2, A1B3, A2B1, A2B2, A2B3}, then the following orthogonal contrast matrix, C, can be defined as
1
2
There are two necessary criteria for C to be termed an orthogonal contrast matrix. First, the rows must sum to zero. Second, the rows (but not the columns) must be independent from one another (i.e., equals a (R − 1) × (R − 1) diagonal matrix, with zeros in the off-diagonal). The construction of orthogonal contrast matrices for repeated measures designs can be complex (especially for designs with two or more factors). Hence, we encourage readers to use statistical software, such as R (R Core Team, 2021) to construct an orthogonal contrast matrix based on a specific design (see Appendix A for R code on how to construct orthogonal contrast matrices; or see UCLA Statistical Consulting Group, 2011 for an overview of different types of orthogonal contrast matrices, along with code to produce those matrices),
A specific type of orthogonal contrast matrix, termed an orthonormal contrast matrix, is of special interest in the context of U-RM-ANOVA. Orthonormal contrast matrices are orthogonal contrast matrices (i.e., orthonormal matrices satisfy the two criteria of orthogonal matrices), whose sum of squared row elements equals 1 (i.e., after squaring each value in the matrix, the sum across each row equals 1). Consequently, the C matrix defined in Equation 1 is an orthonormal contrast matrix. The small benefit of the orthonormal matrix (which will be demonstrated later in this article) is that the sums of squares, mean squares, and F-values of U-RM-ANOVA may be exactly replicated in SEM with the use of an orthonormal contrast matrix (as opposed to an orthogonal contrast matrix, which only directly reproduces F-values, see Voelkle, 2007). As an aside, we note that if some contrast matrix C is orthogonal but not orthonormal, then the rows of C may be scaled to make a new orthonormal matrix (this may be useful for readers using statistical software that can produce orthogonal contrasts, but not orthonormal contrasts, e.g., we rescaled an orthogonal contrast matrix produced by the statistical software R to obtain C in Equation 1). Appendix B demonstrates how to rescale the rows of an orthogonal matrix to create an orthonormal matrix.
Taken together, this section alludes to how an orthogonal contrast matrix may be used to evaluate the null hypotheses of main and interaction effects from U-RM-ANOVA (i.e., by forming, and testing whether specific means of significantly differ from zero). Later in this article we will use this information to estimate from y in SEM, and use the tools of SEM to perform the significance tests that reflect main and interaction effects.
Review of Sphericity
Univariate repeated measures ANOVA assumes that the variance covariance matrix of repeated measures conforms to a specific pattern, commonly referred to as “sphericity” or a “spherical matrix”. Therefore, using SEM to test main and interaction effects in the same manner as U-RM-ANOVA requires sphericity to be imposed in SEM, and this section reviews the definition of sphericity in preparation for its inclusion in SEM.
Importantly, the colloquial definitions of sphericity provided in applied statistics textbooks for psychology researchers are often simplifications that do not readily generalize to U-RM-ANOVA designs with two or more within subjects factors. In response, we review the colloquial definitions, describe their potential for misunderstanding, describe the actual definition as originally provided in the statistics literature (Huynh & Feldt, 1970, p. 1587, Theorem 3), and clarify that definition for U-RM-ANOVA designs with more than one within-subjects factor.
Colloquial Definition of Sphericity
In the psychology literature, the colloquial definition of sphericity states that the variances of all pairwise differences between repeated measures are equal (see for example, Field, 1998; Field et al., 2012, pp. 550–552; Lane, 2016; Nimon, 2012). For example, in a one-way U-RM-ANOVA with three levels (i.e., A1, A2, and A3; not the same as this article’s guiding empirical example), the common definition of sphericity prescribes
3
For example, researchers may infer that sphericity for the 2 × 3 design from the empirical example implies that the 13 unique pairwise differences across the 6 repeated measures must have equal variances, such that
4
A second confusing aspect of the colloquial definition of sphericity concerns the separate tests of sphericity provided by statistical software for each main or interaction effect (i.e., each main and interaction effect receive their own test of sphericity unless the effect has one degree of freedom; see the section Sphericity for Main and Interaction Effects of U-RM-ANOVA for details). In particular, if the colloquial definition of sphericity were true, then only one test should be needed regardless of the within-subjects design (i.e., the definition refers to pairwise differences rather than main/interaction effects). Therefore, the outputs from statistical software implementing U-RM-ANOVA do not corroborate the colloquial definition.
We want to briefly mention that, in fact, an omnibus test can be constructed to test for sphericity in multiple effects simultaneously. This omnibus test can decrease the Type I error rate of incorrectly rejecting the assumption that sphericity holds. This test will be described in the section Testing Sphericity Using L-RM-ANOVA.
Separate from the colloquial definition, psychology texts often describe compound symmetry as a special case of sphericity, and then describe assumptions in terms of compound symmetry (Field, 1998; Haverkamp & Beauducel, 2017; Maxwell & Delaney, 2004). Even though these texts do not claim equality across compound symmetry and sphericity, the compound symmetry simplification also does not easily generalize to higher-order within-subjects designs.
Original Definition of Sphericity
Huynh and Feldt (1970, p. 1587, Theorem 3) provide the original definition of sphericity . In general (i.e., not specific to the guiding empirical example), a P × P matrix (e.g., a variance covariance matrix), conforms to sphericity if, and only if,
5
However, the definition of sphericity requires more specificity for higher-order U-RM-ANOVAs (i.e., Huynh & Feldt, 1970, definition was for a single repeated measures factor, not for within-subjects designs with more than one repeated measures factor). Similarly, multivariate statistics textbooks describing U-RM-ANOVA often provide the original definition by Huynh and Feldt (1970), without explaining how to extend the definition to obtain separate tests for each main/interaction effect (for example, see Stevens, 2002, p. 421). Hence, we explain the original (1970) definition using the variance covariance matrix of the contrast data (i.e., ) from the empirical example to show how to generalize the definition to multi-factorial designs.
Sphericity for Main and Interaction Effects of U-RM-ANOVA
Following the empirical example, let refer to the R × R variance covariance matrix of y, and refer to the (R − 1) × (R − 1) variance covariance matrix of . Then, adhering to the definitions for C given in Equation 1, it follows that
6
7
Thus, the general definition for sphericity in U-RM-ANOVA refers to (as opposed to variances of all pairwise differences). Once the set of orthogonal contrasts for the main/interaction effects of a specific U-RM-ANOVA design are identified and organized into an (R − 1) × R matrix C, sphericity holds for each effects if, and only if, the variances in that reflect the effect are equal, and the covariances in that reflect the effect equal zero.
Testing Sphericity, Main Effects, and Interaction Effects in SEM
In the following sub-sections, we demonstrate how to test sphericity as well as main and interaction effects with and without assuming sphericity. We use the open-source R package lavaan (Rosseel, 2012) to estimate the models. We furthermore provide the open-source R package semnova (Langenberg & Mayer, 2020) which is an interface to lavaan and includes user-friendly functions to perform the proposed tests.1
Latent Repeated Measures ANOVA (L-RM-ANOVA)
The SEM framework can be used to test sphericity because SEMs can (1) estimate from , (2) place constraints on the estimated elements in that conform to sphericity, and (3) perform likelihood ratio tests to quantify statistical significance (analogous to Mauchly’s test). This section characterizes estimation of from (because the latter two points are commonplace), as a special case of latent repeated measures analysis of variance (L-RM-ANOVA, Langenberg et al., 2020; an extension of the growth components approach given by Mayer et al., 2012). L-RM-ANOVA estimates individual contrasts (i.e., ) as a set of latent variables in SEM. Stated differently, L-RM-ANOVA identifies how to rewrite Equation 2 as an SEM.
To clarify, recall that the SEM measurement model for P manifest variables and Q latent variables may be written as
8
In particular, L-RM-ANOVA sets P = Q, , and (i.e., there are as many latent variables as manifest variables, and the latent variables fully account for the means, variances, and covariances of the manifest variables), such that
9
Equation 9 is similar to Equation 2 (i.e., ), when conceptualizing as a set of latent variables (i.e., ), and C as a matrix of loadings ( ). Conceptually, multiplying both sides of Equation 2 by the inverse of C should yield an identical form as the SEM measurement model in Equation 9 (i.e., ). However, C cannot be inverted because it is a not a square matrix (currently R × (R − 1)). L-RM-ANOVA augments C by concatenating a row that contains the constant to the top of the matrix; creating an invertible matrix that maintains orthogonality. Following the guiding example,
10
11
Figure 2
Path Diagram of the SEM Implementing a 2 × 3 (Sentence Type × Grade) Repeated Measures Design Using an Orthonormal Contrast Matrix
Note. Rectangles represent the manifest variables y and the circles represent the latent contrast variables . The weights of the arrows going from to y can be found in the matrix. Intercepts and residual (co)variances of the manifest variables y are set to zero. Intercepts and (co)variances of the contrast variables are freely estimated.
Testing Sphericity Using L-RM-ANOVA
Sphericity for a given main/interaction effect may be tested by performing a -difference test across a model that assumes sphericity for the effect versus a model that does not. Following the guiding example, sphericity for the main effect of B and the interaction may be tested (sphericity is not relevant for the main effect of A because it only has one degree of freedom; see the section Sphericity for Main and Interaction Effects of U-RM-ANOVA). Sphericity for the main effect of B prescribes that both and ; whereas sphericity for the interaction effect prescribes , and (see the section Sphericity for Main and Interaction Effects of U-RM-ANOVA for details). The constrained covariance matrices and are given by:
12
Table 1
Mauchly’s Test and -Difference Test for Sphericity
Mauchly’s Test
|
-Difference Test
|
|||||
---|---|---|---|---|---|---|
Effect | df | p | df | p | ||
B | 69.81 | 2 | <.001 | 70.65 | 2 | <.001 |
A × B | 72.41 | 2 | <.001 | 73.28 | 2 | <.001 |
B + A × B | 112.51 | 4 | <.001 |
As mentioned earlier, it is also possible to test both sphericity assumptions simultaneously. That is, we can perform an omnibus test that tests for sphericity of in the effect of B and the interaction effect of A and B. This omnibus test can decrease the Type I error rate that can arise due to multiple testing. The results for the omnibus test can also be found in Table 1.
Simulation 1: Comparing Mauchly’s Sphericity Test to the SEM Based Test
In the previous sections, we revisited the original definition of sphericity and showed that SEM can be used to test for sphericity in multi-factorial experimental designs. We further compared Mauchly’s Sphericity Test to the SEM based test using an empirical example, where both tests showed very similar results. It is, however, important to know the statistical properties across different settings, for instance, sample sizes and degrees of departure from sphericity. In this section, we will conduct a small-scale simulation study to compare both tests. The aim of this study is to guide applied researchers to decide in which situation which test is to be preferred.
Method
We generated data for a 2 × 3 repeated measures design following the model in Figure 2. We investigated Type 1 error and power of Mauchly’s Test and the SEM based test to detect departures from sphericity for the main effect of Factor B which has three levels (i.e., the effect consists of two contrasts). For the data generation, we set the means of all contrasts to zero ( ) and the variances to one ( ). We manipulated the degree of departure from sphericity in terms of Mauchly’s W, where W = 1 indicates no departure and W = 0 indicates the largest possible departure. We used four values of W which imply a certain value for the covariances between the contrast variables ( , , , ). All covariances where set to this value, although we would have only had to manipulate and in order to impose different degrees of departure from sphericity on the main effect of B and the interaction effect of A and B (i.e., the other covariances are not important and manipulating them does not do any harm). The four conditions were chosen to cover the full range of possible values of W. We further manipulated the sample size (N = 30, 40, 50, 60, 70, 80, 90, 100). The smallest sample size was chosen because it is close to N = 27 which is the smallest sample size required to estimate the model (i.e., we estimate 6 means, 6 variances, and 15 covariances). The larger sample sizes were chosen to give a clear picture when the tests reach a power of at least .8. An overview of all conditions is shown in Table 4. We used 1,000 replications for each of the aforementioned conditions. The simulation was performed using the statistical software R (R Core Team, 2021) in combination with the car package (Fox & Weisberg, 2019) for Mauchly’s test and Mplus (L.K. Muthén & Muthén, 2017) for the SEM based test. Finally, we chose an alpha level of for both tests.
Results and Discussion
The simulation results are shown in Figure 3. The leftmost tile of the figure shows the Type 1 error for both tests when sphericity is not violated. Mauchly’s Test as well as the SEM based test show a Type 1 error around the desired 5%. Neither of the tests seem to have an overly inflated Type 1 error. The other tiles show the statistical power of both tests. We can see that the power increases with increasing sample size and increasing departure from sphericity. Both tests show very similar power across all conditions suggesting that SEM is a viable alternative to Mauchly’s Test.
Figure 3
Power and Type 1 Error for Mauchly’s Sphericity Test and the SEM Based Test as a Function of Sample Size N and Degree of Departure From Sphericity (Mauchly’s W)
Note. The first grayed tile (W = 1) shows the Type 1 error. The other tiles show the power.
Testing Main and Interaction Effects of U-RM-ANOVA Using L-RM-ANOVA
A given main or interaction effect may be tested by performing a -difference test across two models: a constrained model in which the means belonging to a particular effect are fixed to zero (i.e., conforming to the null hypothesis), and a second unconstrained model in which the means are freely estimated (i.e., confirming to the alternative hypothesis). But, to adhere to the assumptions of U-RM-ANOVA, both models used in -difference test must impose sphericity.
Following the guiding example, the null hypothesis for the main effect of A prescribes ; the null hypothesis for the main effect of B prescribes both and ; and the null hypothesis for interaction effect prescribes both and (see the section Review of Orthogonal Contrast Matrices for details). Therefore, testing the main effect of A compares a model that constrains , versus a model that freely estimates (without any extra constraints for the sphericity assumption). Testing the main effect of B compares a model that constrains and , versus a model that freely estimates and (while both models impose sphericity for the main effect of B). Testing the interaction effect compares a model that constrains and , versus a model that freely estimates and (while both models impose sphericity for the interaction effect).
Table 2 displays the relevant test statistics, and their p-values, for the main and interaction effects from U-RM-ANOVA and L-RM-ANOVA. Importantly, the product of an F-value and its numerator degrees of freedom (i.e., ) is asymptotically equivalent to the -difference value from L-RM-ANOVA (i.e., , e.g., Fahrmeir et al., 2013; Kohler, 1982; Lu & Zhang, 2010); and the approximate F-values are given from L-RM-ANOVA using that transformation.
Table 2
Results for Main and Interaction Effects
U-RM-ANOVA
|
L-RM-ANOVA Sphericity
|
L-RM-ANOVA
|
||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Effect | F | df1 | df2 | p | pGG | pHF | df1 | ≈ F | p | df1 | ≈ F | p | ||
A | 371.90 | 1 | 167 | <.001 | 196.82 | 1 | 196.82 | <.001 | 196.82 | 1 | 198.87 | <.001 | ||
B | 733.48 | 2 | 334 | <.001 | <.001 | <.001 | 566.14 | 2 | 283.07 | <.001 | 318.45 | 2 | 159.22 | <.001 |
A × B | 431.78 | 2 | 334 | <.001 | <.001 | <.001 | 429.04 | 2 | 214.52 | <.001 | 249.59 | 2 | 124.79 | <.001 |
Note. GG = Greenhouse-Geisser corrected p-value. HF = Huynh-Feldt corrected p-value.
The test statistics and p-values differ across U-RM-ANOVA and L-RM-ANOVA (as opposed to the tests of sphericity from the previous sub-section). The difference across the approaches arises because U-RM-ANOVA bases its test statistics (and p-values) on F-ratios (and F-distributions), whereas L-RM-ANOVA uses -differences (and compares those values to -distributions). Therefore, discrepancies may arise across the approaches, even though they are designed to test the same hypotheses.
To narrow the gap, the following sub-section (after the excursus) shows how to reproduce both the sums of squares and the F-values from U-RM-ANOVA using the parameter estimates of L-RM-ANOVA. Furthermore, p-values from U-RM-ANOVA may then be more closely reproduced using L-RM-ANOVA by comparing the reproduced F-values to an F distribution.
Excursus: Interpreting Main Effects in the Presence of Interaction Effects
Although not the main focus of this article, we would like to briefly pick up on the discussion about interpreting main effects in the presence of interaction effects. We find it important to note that point estimates of the main effect of sentence type should be interpreted with care. That is, the estimate of the average difference in gaze duration between regular and Landolt sentences across grades is the unweighted average of the conditional effects of sentence type on gaze duration given different grades. This may not be the effect that researchers are interested in. As many researchers have argued in the past, the effect of an independent variable (sentence type) on a dependent variable (gaze duration) is dependent on the moderator (grade) in the presence of an interaction effect (e.g., Aguinis, 2004; Aguinis et al., 2016; Aiken & West, 1991; Busenbark et al., 2021; Cohen et al., 2003). This phenomenon can also be observed in Figure 1. An alternative approach may be to express the effect of the sentence type on gaze duration as a function of the grade and to look at the conditional effects. This enables us to look at the difference between regular and Landolt sentences in different grades separately. Still, we believe that aggregated or average effects can add valuable additional information in some contexts (see also the discussion in Gräfe et al., 2022).
Calculating Sums of Squares, Mean Squares, and F-Ratios Using L-RM-ANOVA
As noted earlier in this article (section Review of Orthogonal Contrast Matrices), orthonormal contrast matrices provide the opportunity to reproduce the sums of squares, mean squares, and F-ratios for each effect from U-RM-ANOVA (Voelkle, 2007). This sub-section shows how to reproduce each of these components using L-RM-ANOVA, and Table 3 shows the estimates across U-RM-ANOVA and L-RM-ANOVA. Importantly, all sums of squares, mean squares, and F-ratios are formed from an L-RM-ANOVA that both creates latent contrasts using an orthonormal matrix, and imposes sphericity. The general formulas to calculate the sums of squares are:
13
14
15
16
17
Table 3
Sums of Squares
U-RM-ANOVA
|
L-RM-ANOVA
|
|||||||||
---|---|---|---|---|---|---|---|---|---|---|
Effect | SS | RSS | MS | MSR | F | SS | RSS | MS | MSR | F |
A | 184.09 | 82.67 | 184.09 | 0.50 | 371.90 | 184.09 | 82.67 | 184.09 | 0.50 | 371.90 |
B | 210.45 | 47.92 | 105.23 | 0.14 | 733.48 | 210.45 | 47.92 | 105.23 | 0.14 | 733.48 |
A × B | 87.28 | 33.76 | 43.64 | 0.20 | 431.78 | 87.28 | 33.76 | 43.64 | 0.20 | 431.78 |
The sums of squares for a given effect from U-RM-ANOVA (written as, e.g., , , or ) may be reproduced in L-RM-ANOVA by summing across the squared means of the effect’s latent contrasts, and multiplying the sum by the sample size (Equation 13). Following the guiding example, ; ; and (N = 169 in the guiding example).
The residual sums of squares for a given effect from U-RM-ANOVA (written as, e.g., , , or ) may be reproduced in L-RM-ANOVA by summing across the variances of the effect’s latent contrasts, and multiplying the sum by the sample size (Equation 14). Following the guiding example, , , and (again, N = 169 in the guiding example).
The mean squares for a given effect from U-RM-ANOVA (written as, e.g., , , or ) may be reproduced in L-RM-ANOVA by dividing the effect’s sums of squares by its numerator degrees of freedom (i.e., the number of contrasts that underlies the effect, Equation 15), such that ; ; and . In the guiding example, , , and .
The mean squares of the residuals for a given effect from U-RM-ANOVA (written as, e.g., , , or ) may be reproduced in L-RM-ANOVA by dividing each effect’s residual sums of squares by the product of N − 1 and the effect’s numerator degrees of freedom (Equation 16). Following the guiding example, we have ; ; and .
F-values from U-RM-ANOVA may be reproduced in L-RM-ANOVA by dividing the effect’s mean squares by its mean squared residuals (Equation 17). Following the guiding example leads to , , and . As shown in Table 3, all values are virtually identical across U-RM-ANOVA and L-RM-ANOVA; and therefore researchers may compute F and p-values that closely match those from U-RM-ANOVA using L-RM-ANOVA.
Finally, we would like to note that a major advantage of SEM is that sums of squares can also be calculated in the presence of missing values. Parameter estimates can be obtained through full information maximum likelihood, which are then used to calculate sums of squares as shown above. The above calculations can further be extended to mixed within- and between-subjects designs with any number of factors. We refer the interested reader to Langenberg et al. (2022), which includes instructions in the appendix to calculate F-values and the effect size measure for larger within- and between-subjects designs.
Testing Main and Interaction Effects Without the Assumption of Sphericity Using L-RM-ANOVA
In contrast to U-RM-ANOVA, L-RM-ANOVA may test main/interaction effects without the assumption of sphericity. The model comparison procedure described above (i.e., -difference tests described in the section Testing Main and Interaction Effects of U-RM-ANOVA Using L-RM-ANOVA) can relax sphericity by estimating all elements of in both models (i.e., removing the constraints that describe sphericity). The right part of Table 2 provides estimates of main and interaction effects from L-RM-ANOVA when estimated without the assumption of sphericity.
Currently, U-RM-ANOVA relies on post-hoc corrections to relax sphericity (e.g., Greenhouse-Geisser and Huynh-Feldt corrections), which computationally (and conceptually) differ from direct estimation of as performed via L-RM-ANOVA. Table 2 also provides p-values associated with these common post-hoc corrections.
Imposing the assumption of sphericity can increase power of hypothesis tests in U-RM-ANOVA. This is true for any type of assumption imposed to statistical models as fewer parameters need to be estimated if the assumptions is true. However, if the assumption, in fact, does not hold, an increased Type I error rate may arise (e.g., Haverkamp & Beauducel, 2017).
Simulation 2: Comparing U-RM-ANOVA and L-RM-ANOVA With and Without Sphericity
In the previous two sections, we compared hypothesis tests of U-RM-ANOVA and L-RM-ANOVA with and without the assumption of sphericity. Both approaches yield very similar F-values and p-values. It remains to be answered whether any of the approaches outperforms the others in terms of statistical properties. For instance, L-RM-ANOVA is estimated through maximum likelihood which is known to have and inflated Type 1 error in small samples (e.g., Green & Babyak, 1997; Hu et al., 1992; Muthén & Kaplan, 1985; Raykov & Widaman, 1995). U-RM-ANOVA, in contrast, is said to have a power advantage but should also have and inflated Type 1 error when sphericity does not hold. In this section, we will examine the statistical properties of the aforementioned approaches across several settings. We have two main hypotheses: (1) We expect L-RM-ANOVA to have an inflated Type 1 error when testing main and interaction effects in small samples as compared to RM-ANOVA because it is estimated through maximum likelihood which relies on asymptotic theory, and (2) we also expect the models that assume sphericity to have an inflated Type 1 error when testing main and interaction effects and larger bias of effect size estimates when sphericity does not hold.
Method
We generated data for a 2 × 3 repeated measures design following the model in Figure 2. We investigated Type 1 error, power, bias and root mean squared error (RMSE) of multivariate repeated measures ANOVA (RM-ANOVA), U-RM-ANOVA (with and without Greenhouse-Geisser and Huynh-Feldt corrections), L-RM-ANOVA (with and without assuming sphericity) for the test of the main effect of the Factor B which has three levels (i.e., the effect consisted of two contrasts). We manipulated the degree of departure from sphericity W (i.e., we again set the variances to one and only manipulated the covariances), sample size N, and the effect size . In particular, we again used four degrees of departure from sphericity ( , , , ), and eight sample sizes, (N = 30, 40, 50, 60, 70, 80, 90, 100). We further used four effect sizes ( , 0.01, 0.06, 0.14). is a common effect size measure for repeated measures ANOVA (e.g. Keselman et al., 1998; Maxwell et al., 2008; Olejnik & Algina, 2000; Perugini et al., 2018; Steiger, 2004), where indicates no effect is present and the other three choices represent a small, medium, and large effect according to Cohen (1988). We imposed a particular effect size by setting the means of the two regarding contrast variables ( and ) to a certain value. Since means, variances and the covariance of the contrast variables contribute to the effect size, the two means were chosen in a way that accounted for the degree of departure from sphericity (i.e., the two means were different for different Ws even if the effect size was the same). An overview of all conditions is shown in Table 4. We used 1,000 replications for each of the aforementioned conditions. The simulation was performed using the statistical software R (R Core Team, 2021) in combination with the car package (Fox & Weisberg, 2019) for RM-ANOVA and U-RM-ANOVA, and Mplus (Muthén & Muthén, 2017) for the SEM based models. Finally, we chose an alpha level of for all of the performed hypothesis tests.
Table 4
Conditions Used in the Two Simulation Studies
Conditions
|
Population parameters
|
|||||
---|---|---|---|---|---|---|
Simulation study | W | a | b | |||
1, 2 | 0 | 0.4 | 0.00 | 0.00 | 1 | 0.77 |
1, 2 | 0 | 0.6 | 0.00 | 0.00 | 1 | 0.63 |
1, 2 | 0 | 0.8 | 0.00 | 0.00 | 1 | 0.45 |
1, 2 | 0 | 1.0 | 0.00 | 0.00 | 1 | 0 |
2 | 0.01 | 0.4 | 0.10 | 0.09 | 1 | 0.77 |
2 | 0.01 | 0.6 | 0.10 | 0.09 | 1 | 0.63 |
2 | 0.01 | 0.8 | 0.10 | 0.09 | 1 | 0.45 |
2 | 0.01 | 1.0 | 0.10 | 0.07 | 1 | 0 |
2 | 0.06 | 0.4 | 0.25 | 0.24 | 1 | 0.77 |
2 | 0.06 | 0.6 | 0.25 | 0.23 | 1 | 0.63 |
2 | 0.06 | 0.8 | 0.25 | 0.21 | 1 | 0.45 |
2 | 0.06 | 1.0 | 0.25 | 0.18 | 1 | 0 |
2 | 0.14 | 0.4 | 0.40 | 0.38 | 1 | 0.77 |
2 | 0.14 | 0.6 | 0.40 | 0.36 | 1 | 0.63 |
2 | 0.14 | 0.8 | 0.40 | 0.34 | 1 | 0.45 |
2 | 0.14 | 1.0 | 0.40 | 0.29 | 1 | 0 |
Note. The first simulation study used only the first four conditions. The second simulation study used all conditions. We used eight different samples sizes (N = 30, 40, 50, 60, 70, 80, 90, 100) which have been omitted to reduce the size of the table.
aMeans of contrast variables that belong to an effect with one degree of freedom (i.e., intercept and main effect of A). bMeans of contrast variables that belong to an effect with two degrees of freedom (i.e., main effect of B and interaction effect of A and B).
Results and Discussion
Power
Hypothesis tests were based on an F-test and the sums of squares for RM-ANOVA models, and based on a likelihood ratio test for SEMs. The Greenhouse-Geisser and the Huynh-Feldt corrections showed virtually identical results, which is why we will summarize both corrections under U-RM-ANOVA + GG/HF. We thus compared the statistical power and Type 1 error of five models, namely L-RM-ANOVA, L-RM-ANOVA + sphericity, RM-ANOVA, U-RM-ANOVA, and U-RM-ANOVA + GG/HF. The results are shown in Figure 4, where the first row shows Type 1 error ( ) and the other rows show power ( ). As hypothesized, the SEM based models showed a slightly inflated Type 1 error of up to 7.3% for the small samle size condition N = 30 and when the simulated effect size was zero (as shown in the first row of the figure). The Type 1 error was further inflated of up to 8.5% for all models that mistakenly assumed sphericity when the assumption did not hold (first row, the three right-hand tiles). The Type 1 error decreased with increasing sample size for the SEM based models, but seemed to be be constant across sample sizes for univariate models that incorrectly assume sphericity. Furthermore, power increased with sample size and effect size for all models. The multivariate approaches (RM-ANOVA and L-RM-ANOVA) were unaffected from departure from sphericity in terms of power. The univariate models (U-RM-ANOVA and L-RM-ANOVA + sphericity) showed higher power by up to 15.8% as compared to multivariate models, particularly when sphericity was strongly violated W = 0.4 and the sample size was small N = 30. We argue that this presumed power “advantage” is bought from the inflated Type 1 error and should not be trusted. The corrected univariate model (U-RM-ANOVA + GG/HF) also shows a slight power advantage (by 6.3% with W = 0.4 and N = 30) which we think can be trusted as the model does not show Type 1 error inflation.
Figure 4
Power and Type 1 Error for RM-ANOVA, U-RM-ANOVA, L-RM-ANOVA and L-RM-ANOVA Assuming Sphericity as a Function of Effect Size , Sample Size N, and Degree of Departure From Sphericity (Mauchly’s W)
Note. The first grayed row ( ) shows the Type 1 error. The other rows show the power.
As mentioned in the beginning of the previous paragraph, hypotheses are tested by different statistical tests in RM-ANOVA models and SEMs. Those tests can differ in terms of power and Type 1 error rate. However, it is also possible to derive the sums of squares and an F-test based on the point estimates of means, variances and covariances from SEM (see Calculating Sums of Squares, Mean Squares, and F-Ratios Using L-RM-ANOVA). The resulting test would have the same statistical properties as the F-test of RM-ANOVA. We limited comparison, however, to the classical tests (i.e., F-tests for RM-ANOVA and likelihood ratio test for SEM) because they are most common in the two frameworks.
Bias
Relative and absolute bias of the estimated effect size was identical across the univariate models (L-RM-ANOVA + sphericity and U-RM-ANOVA), and also across the multivariate models (L-RM-ANOVA and RM-ANOVA). This pattern is not surprising because exact sums of squares (which effect size estimates rely on) can exactly be derived in SEM (see Calculating Sums of Squares, Mean Squares, and F-Ratios Using L-RM-ANOVA). The results are shown in Figure 5, where the first row shows the
Figure 5
Bias for RM-ANOVA, U-RM-ANOVA, L-RM-ANOVA and L-RM-ANOVA Assuming Sphericity as a Function of Effect Size , Sample Size N, and Degree of Departure From Sphericity (Mauchly’s W)
Note. The first grayed row ( ) shows the absolute bias. The other rows show the relative bias.
In general, relative and absolute bias decreased with increasing sample size and relative bias decreased with effect size. Violations from sphericity did not seem to affect relative bias. Univariate models had a smaller bias as compared to multivariate models. As apposed to our expectation, departures from sphericity did not seem to affect bias for neither the multivariate or the univariate models.
Root Mean Squared Error
As for bias, the relative and absolute RMSE of the estimated effect size was identical across the univariate models (L-RM-ANOVA + sphericity and U-RM-ANOVA), and also across the multivariate models (L-RM-ANOVA and RM-ANOVA). The results are shown in Figure 6, where the first row shows the
Figure 6
Root Mean Squared Error (RMSE) for RM-ANOVA, U-RM-ANOVA, L-RM-ANOVA and L-RM-ANOVA Assuming Sphericity as a Function of Effect Size , Sample Size N, and Degree of Departure From Sphericity (Mauchly’s W)
Note. The first grayed row ( ) shows the absolute RMSE. The other rows show the relative RMSE.
Testing Sphericity, Main Effects, and Interaction Effects for L-RM-ANOVAs With Measurement Models
Traditional (U-)RM-ANOVA assumes that the outcome variable can be observed across experimental conditions. The outcome of interest, however, oftentimes includes questionnaire items, test scores, reaction times, and accuracies that serve as indicators to measure an underlying psychological construct, such as cognitive processes, attention, traits, or attitudes. Underlying constructs, however, cannot be measured directly in many cases and indicators suffer from measurement error. Latent variable models can be used to explicitly model measurement error. This section describes how sphericity, main effects, and interaction effects may be tested in extensions of L-RM-ANOVA that include measurement models (conceptually similar to second order growth curves; see Langenberg et al., 2020). We will use the two manifest measures mean gaze duration and mean total viewing to measure the latent construct “reading ability” and re-analyze the data.
Figure 7 extends the guiding example to include two manifest measures (mean gaze duration and mean total viewing duration, see Figure 1) at each of the six measurement conditions; forming six latent common factors ( – ) that are transformed into six latent contrasts ( – ). Rectangles represent manifest variables and circles represent latent variables. Following standards for longitudinal models with multiple indicators per measurement occasion (e.g., Newsom, 2015, p. 42), residual covariances are estimated between the same manifest indicator across the six measurement conditions (depicted as gray double-headed arrows). The variables explain common variance in the manifest variables in each of the conditions. However, residual covariances between manifest variables across conditions may occur. These correlations can be accounted for by including residual covariances between mean gaze duration across conditions and between mean total viewing duration, respectively.
Figure 7
Path Diagram of the SEM Implementing a 2 × 3 (Sentence Type × Grade) Repeated Measures Design Using an Orthonormal Contrast Matrix and a Measurement Model With the Manifest Variables Mean Gaze Duration and Mean Total Viewing Duration
Note. Rectangles represent manifest variables and circles represent latent variables. Intercepts of mean gaze duration in each condition are set to zero. Intercepts of mean total viewing duration are estimated but constrained to be equal. Residual variances of the manifest variables y are freely estimated. Residual covariances among the y of the same type of sentence and variable are estimated but constrained to be equal. Intercepts and (co)variances of the contrast variables are freely estimated.
Oftentimes, such designs are inappropriately analyzed by averaging across the two indicators, so that traditional methods can be used (e.g., RM-ANOVA). Averaging across indicators, however, can lead to ignoring other random factors, such as stimuli, and can introduce substantial bias (Judd et al., 2017). As a solution, linear mixed models (LMM; Fitzmaurice et al., 2011; Laird & Ware, 1982) are able to include all of the measures and to estimate the model. LMMs, however, assume that the two indicators are parallel measurements. A parallel measurement model assumes that loadings equal one ( ), the intercepts equal zero ( ), and the residual variances are equal ( ). This condition, however, does not necessarily hold in the given example and neither in many other examples from psychological research, such as test scores or questionnaire items. In SEM, on the contrary, the assumption can be relaxed and the more general congeneric measurement model can be used (as was used in Figure 7).
Inclusion of measurement models across repeated measures (e.g., second-order growth curves) further necessitates consideration of (and tests for) measurement invariance (e.g., Newsom, 2015, Ch. 2). However, given this article’s focus on sphericity and main/interaction effects, the didactic nature of this section (i.e., we do not aim to explicitly test psychological theories), and available literature on measurement invariance (Newsom, 2015; Pitts et al., 1996; Widaman et al., 2010), this article assumes that readers have knowledge of measurement invariance and does not discuss the topic in detail (see Langenberg et al., 2020, for measurement invariance in L-RM-ANOVA). Instead, we note that data in the guiding example adhered to a model with strong measurement invariance (CFI = .946, TLI = .914, RMSEA = .115, 90% CI RMSEA = [.099, .132]), which facilitates comparisons across the means/intercepts of – . That is, loadings are constrained to be equal across time (i.e., the first loading is fixed to 1 and second loading is the same ) and so are the intercepts of the manifest variables (i.e., the intercept of the first indicator is fixed to 0 and the second indicator is constrained to be equal).
As is standard for SEM estimation more generally (and departing from prior analyses that only used complete data), this section analyzes all available data (N = 268) via full information maximum likelihood (FIML). Analogous to measurement invariance, we assume that readers have sufficient knowledge of FIML and its application in SEM to account for missing data (Enders, 2013).
If strong measurement invariance holds, then sphericity of each main/interaction effect may be tested in an identical manner as L-RM-ANOVA without measurement models. Sphericity may still be tested via -difference tests across models that constrain/relax the appropriate variances and covariances of (see the section Testing Sphericity Using L-RM-ANOVA). In the guiding example, sphericity for the main effect of B fails ( , df = 2, p < .001), and sphericity for the interaction effect also fails ( , df = 2, p < .001).
Main and interaction effects may still be tested via -difference tests across models that constrain/relax the appropriate means of the latent contrasts. And those comparisons may be done under the assumption of sphericity (i.e., both models in the comparison include the constraints that conform to sphericity), or not (i.e., both models in the comparison fully estimate ). Although this article does not intend to compare SEM to LMM, we would like to point out that imposing sphericity on the variances of latent contrast variables in an SEM is conceptually similar to constraining the covariance matrix of the random effects in a LMM (for similarities between SEM and LMM in the context of growth curves, see, e.g., Rovine & Molenaar, 1998; see also Newsom, 2002). When assuming sphericity, there is evidence for a main effect of A ( , df = 1, p < .001), B ( , df = 2, p < .001), and the interaction ( , df = 2, p < .001). When relaxing sphericity, there is evidence for a main effect of A ( , df = 1, p < .001), B ( , df = 2, p < .001), and the interaction ( , df = 2, p < .001).
Conclusions and Future Directions
This article identified and exemplified the direct connection between U-RM-ANOVA and SEM: L-RM-ANOVA. More specifically, latent contrasts may be formed by using the inverse of an orthogonal contrast matrix as a factor loading matrix, sphericity corresponds to specific constraints on the variances and covariances of those latent contrasts (i.e., specific elements in ), and tests of main/interaction effects correspond to the significance of latent contrast means. As shown in the examples, sphericity may be imposed, relaxed, and tested in the SEM context via -difference tests, and results mirror those from Mauchly’s test (see the section Testing Sphericity Using L-RM-ANOVA). And, although the -difference tests of SEM do not exactly match the F-tests from U-RM-ANOVA, they do test the same hypotheses, and sums of squares and exact F-values may be reproduced in L-RM-ANOVA via an orthonormal contrast matrix. Finally, taking full advantage of the SEM framework, the L-RM-ANOVA approach can include measurement models and accommodate missing data via FIML when testing for sphericity, main effects, and interaction effects. Therefore, this article serves to fill the gap of how U-RM-ANOVA is a special case of SEM.
Two simulation studies were performed: (1) examining the statistical properties of the SEM based sphericity test, and (2) comparing properties of (U-)RM-ANOVA and L-RM-ANOVA with and without sphericity. The first simulation study shows that Mauchly’s test and the SEM based test yield virtually identical power and Type 1 error rates. The second simulation study shows that RM-ANOVA and L-RM-ANOVA have similar statistical properties. However, L-RM-ANOVA has a slightly inflated Type 1 error of about 7% for a sample size of N = 30, which approaches the desired 5% for larger sample sizes. The univariate approaches also show an inflated Type 1 error of up to 8% when sphericity is violated. Greenhouse-Geisser and Huynh-Feldt corrected hypothesis tests from U-RM-ANOVA, furthermore, perform best in terms of power. Although the power advantage is rather small, power was larger than for multivariate RM-ANOVA, while the Type 1 error was not inflated as in the case of the uncorrected U-RM-ANOVA.
This article also helped illuminate the definition of sphericity. In contrast to the colloquial definition (i.e., equal variances across all pairwise differences; which only holds for the within-subjects design with one factor) this article emphasized the original definition of sphericity via matrix algebra, showed how that definition may be generalized to within-subjects designs with two or more factors, and illustrated how that definition may be implemented/tested in L-RM-ANOVA both with and without measurement models. It is our hope that researchers familiar with SEM—but either lack clear understanding of sphericity, or adhere to the colloquial definition—can use this article to gain a clearer understanding of sphericity.
Although L-RM-ANOVA can be extended to mixed designs (i.e., those with both within- and between-subjects factors), we narrowed the scope of the article to within-subjects designs to emphasize identification and tests of sphericity, and because L-RM-ANOVA with mixed-designs has been described elsewhere (see Langenberg et al., 2020, for examples of mixed designs). Nevertheless, we briefly mention the two extensions for incorporating between-subjects factors into L-RM-ANOVA. First, coded versions of between-subjects factors (e.g., dummy codes, effect codes) may be used to predict the latent contrasts. Second, a multiple group SEM can estimate the L-RM-ANOVA model for each intersection of the between subjects factors. The first approach must assume that is equal across the intersections of the between-subjects factors, whereas the multiple group approach can test/relax that assumption.
We discuss two future directions which build on the knowledge generated in this article. First, the L-RM-ANOVA approach can be generalized to non-normal and/or non-continuous dependent variables (e.g., log-normal, dichotomous, or ordinal measures). Researchers implementing experimental designs likely measure such variables. SEM has extensions available for non-normal dependent variables (e.g., Finney & DiStefano, 2013). For instance, analyzing error rates requires a binomial distribution, and reaction times often follow a skewed distribution (e.g., log-normal). Stated differently, L-RM-ANOVA can capitalize on SEMs extension for non-normal and non-continuous dependent variables, which researchers will likely find useful. Second, the L-RM-ANOVA approach can be compared to linear mixed models. As was described in the section Testing Sphericity, Main Effects, and Interaction Effects for L-RM-ANOVAs With Measurement Models, L-RM-ANOVA allows for relaxing the assumption of a parallel measurement model which, in contrast, is essential for LMM. It would be interesting to see how both approaches perform if this assumption is violated. Following the first future direction, L-RM-ANOVA and LMMs could also be compared for non-normal outcomes.
In conclusion, this article identified and demonstrated the missing link that connects U-RM-ANOVA to SEM (via L-RM-ANOVA), and provided researchers a clear definition of sphericity. We hope that this article enables applied researchers to use SEM in practice (especially in cases that warrant measurement models), and motivates quantitative researchers to continue building on the L-RM-ANOVA framework.