Person-Centered Data Analysis With Covariates and the R-Package confreq

Configural Frequency Analysis (CFA) is a useful statistical method for the analysis of multiway contingency tables and an appropriate tool for person-oriented or person-centered methods. In complex contingency tables, patterns or configurations are analyzed by comparing observed cell frequencies with expected frequencies. Significant differences between observed and expected frequencies lead to the emergence of Types and Antitypes. Types are patterns or configurations which are significantly more often observed than the expected frequencies; Antitypes represent configurations which are observed less frequently than expected. The R-package confreq is an easy-to-use software for conducting CFAs; another useful shareware to run CFAs was developed by Alexander von Eye. Here, CFA is presented based on the log-linear modeling approach. CFA may be used together with interval level variables which can be added as covariates into the design matrix. In this article, a real data example and the use of confreq are presented. In sum, the use of a covariate may bring the estimated cell frequencies closer to the observed cell frequencies. In those cases, the number of Types or Antitypes may decrease. However, in rare cases, the Type-Antitype pattern can change with new emerging Types or Antitypes.


Theoretical Background
In the field of developmental psychology, person-oriented research is mainly represented by David Magnusson and Lars Bergman from the University of Stockholm in Sweden (see also Stemmler & Heine, 2017). They call this approach holistic and interactionistic, where development is investigated in functionally organized systems (Bergmann & Magnusson, 1997). Systems or individuals are embedded and strongly connected with their context. The individual is seen as a self-organizing unit, which functions and develops as an irreducible whole. This wholeness evolves out of the inter-dynamics between the different structures and elements of the system. The respective structures and processes of the individual encompass psychological constructs like behavior, per ceptions, goals, plans of action, social norms, motives, values, and biological functioning in the brain and physiological system. They obtain their roles and meanings as part of the interaction between the structures of the systems within the whole individual. "A certain element derives its significance not from its structure but from its functional role in the system of which it forms a part" (Magnusson & Mahoney, 2001, p. 5). Development cannot be understood by studying single factors in isolation from other simultaneously operating factors. The individual and the environment influence each other. The individ ual is seen as an active agent or producer of his or her own development (Lerner & Busch-Rossnagel, 1981;Silbereisen & Noack, 2006). Therefore, the human being is seen as a complex dynamic system, which could be understood ideally only under the holistic interactionistic approach (Bergman, Magnusson, & El Khouri, 2003;Magnusson & Allen 1983).
Here we give the reader a brief overview of the interdisciplinary use of CFA (see also Stemmler, 2020): 1. Researchers from the field of hydrobiology (Melcher, Lautsch, & Schmutz, 2012) were interested in spawning habitats of fish, because a sufficient fish stock is important for the ecological system of a river. They found that many European fish prefer the following significant configuration (Type): a shaded habitat with a fine and coarse substrate, depending on high flow velocity. 2. Ilker and Ercan (2018) studied the causes of the death of cattle calves with the help of CFA. They recorded the characteristics of the barn system (separation of mothers and calves or joint rearing), type of disease (intestinal disease, respiratory disease, trauma), vaccination status (vaccinated versus unvaccinated) and gender. The Turkish researchers from veterinary medicine found a significant configuration (Type): Cattle calves died more often than expected from intestinal disease, if they were not vaccinated and if the mothers were kept with the calves together in a barn; the gender of the calf was irrelevant. 3. Lannegrand-Willems et al. (2018) used CFA in their research on developmental psychology. Adolescence and emerging adulthood were seen as periods in life when individuals question and define their place in society and form their identity. This French research group studied the importance of different forms of civic engagement among late adolescents and emerging adults and found a significant configuration (Type): Students, who lacked identity formation did not participate in any political or civic engagement, neither did they vote and they had no feeling of sense of belonging to any social group; at the same time, they were untroubled by the absence of personal commitment. 4. Brenner et al. (2020) studied race and ethnicity considerations in traumatic brain injuries based on a database of Pennsylvania trauma injuries (i.e., Pennsylvania Trauma Outcome Study [PTOS]). Depending on the functional status at discharge, they used CFA to investigate the discharge destinations. Among several Types, the researchers from traumatic brain injury research found that for brain injured patients with a moderate to severe disfunction at discharge, White individuals were overall more likely to receive extended care than individuals in other racial groups who were likely to be sent home.
CFA belongs to the person-oriented analytic approach for the analysis of frequencies in multi-way contingency tables (cf. Stemmler, 2020). In this manuscript, the use of the R-package confreq (Heine et al., 2020) is demonstrated. First, the results of a first order CFA are presented and then covariates are added to the design matrix in two consecutive steps. The effects of covariates on the detection of Types or Antitypes are described. The data used are from a study of juveniles called Chances and Risks in the Life-Course (CURL; Reinecke et al., 2013).

Methodological Background
On the statistical level and for the use of CFA, individuals, animals or objects are grouped in cross-tabulations into disjunct categories based on their respective patterns or config urations (Stemmler & Heine, 2017). Patterns (configurations) with frequencies (o ijk ) that occur significantly more often than their corresponding expected cell frequencies (e ijk ) constitute CFA Types. Configurations occurring significantly less often than predicted under the null hypothesis constitute CFA Antitypes. Log-linear modeling (LLM) and CFA are closely related (von Eye & Mun, 2013). LLM is used to identify the structure among the categorical variables. It parameterizes the distribution of cell frequencies, or, to put it differently, the logarithms of cell frequencies, in terms of main effects and interactions. Each CFA base model can be expressed as a LLM model; however, there are LLM models that cannot be used as a CFA base model. In log-linear modeling the expected frequencies are estimated by using the General ized Linear Model (GLM). The General Linear Model is a special case of the Generalized Linear Model. The GLM is: The function f (y) is called link function. The link function describes the transformation of the dependent variable. In matrix algebra, the parameters are calculated according to the GLM as where Y is a column vector including the dependent variable. X is the matrix of inde pendent variables, and β is the vector of parameters. In LLM the predictor model can be written as: log e = β 0 + β 1 X 1 + β 2 X 2 + … + β n X n with log(e) as the logarithm of the expected frequencies. In LLM one uses ln (the natural logarithm) with base e (i.e., Euler's constant = 2.7182…). If we replace the parameters β by λ we obtain the equation for a log-linear modeling: in the case of three variables A, B and C. The relation of the parameters is the same as in the GLM (see Equation 2): where λ is the parameter vector, log(e) is the vector of expected model frequencies; these are the frequencies that are consistent with the log-based model. X is the design matrix and may contain effect-coded main effects, interaction terms as well as covariates plus the constant (intercept). In addition to effect coding, dummy coding and contrast coding are possible. The design matrix X has as many rows as there are cells or configurations. The first λ weight is always the constant, coded with ones. λ comprises the weights of the independent variables and is a one-column vector with as many entries as X has columns. The basis of CFA is the analysis of frequencies in multi-way contingency tables. Each individual case is cross-tabulated into disjunct categories based on his or her respective pattern or configuration. The underlying logic is the comparison of observed frequen cies f (o) with expected frequencies f (e) . Therefore, a global chi-square, a goodness-of-fit statistic, is calculated (this following formula is, for didactic reasons, presented for three variables but can easily be extended to any number of variables): I = number of categories of the first variable ranging from i = 1, 2, …,I J = number of categories of the second variable ranging from j = 1, 2, …, J K = number of categories of the third variable, ranging from k = 1, 2, …, K o ijk = the observed frequencies of pattern ijk e ijk = the expected frequencies of pattern ijk and the general formula for the degrees of freedom for a contingency table with main effects is: with T representing the number of cells or configurations, with d = 1, …, D representing the number of variables (dimensions), and v d the number of categories of a variable. An important alternative goodness-of-fit statistic to the Pearson's chi-square is the Likelihood Ratio chi-square (LR): The global chi-square tests the following statistical hypotheses (H 0 and H 1 ). Again, the following formulas are, for didactic reasons, presented to three variables but may be extended to any number of variables easily: H 0 : π ijk = π i.. π .j. π ..k (9) H 1 : π ijk ≠ π i.. π .j. π ..k (10) π ijk = defines the cell probabilities at the population level, π i.. π .j. π ..k = define the marginal probabilities at the population level.
In semantic terms, the null (H 0 ) and alternative hypothesis (H 1 ) are expressed as follows: H 0 : There are no significant (local) associations between the variables involved or the variables are independent of each other. H 1 : There are significant (local) associations between the variables involved or the variables are not independent of each other.
The alternative hypothesis includes also higher-order associations. In non-hierarchical log-linear models, lower-order associations are omitted (cf. Rindskopf, 1990). From the perspective of log-linear modeling, leaving out the lower-order association effect param eters can be problematic, because the effects coded in the design matrix may no longer be independent of each other. Subsequently, this makes the interpretation of the effect parameters more complex (cf. von Eye & Mun, 2013).
The expected frequencies were calculated according to the assumption of independ ence: A CFA that is based on the assumption of independence is called first order CFA. In addition, we differentiate between a local and a global chi-square value. A significant global chi-square, which is a goodness-of-fit statistic, is a necessary but not a sufficient condition for a significant local chi-square. It can be that the null hypothesis is rejected, but cell-wise model-data discrepancies may not be extreme enough to result in Types or Antitypes (von Eye & Wiedermann, in press). A significant local chi-square indicates a local association between variables; it is calculated by with 1 degree of freedom. Significant local chi-square values represent Types or Antitypes. Another valuable statistic in the search of Types or Antitypes is the chi-square approximation to the z-test: CFA also allows the use of continuous variables as covariates. "The use of covariates typically carries the estimated cell frequencies closer to the observed cell frequencies, because more information is used in the estimation procedure (von Eye & Niedermeier, 1999)" (von Eye, 2002, p. 309). Note, CFA tests are never fully independent (von Eye, Mair, & Mun, 2010) and an alpha protection is required (e.g., Bonferroni's adjustment or Holm's procedure).

Method Study Subjects
The data for the present paper relate to the project "CURL" (see Reinecke et al., 2013 for an overview). They include 1248 students from 5th grade; 189 (15.1%) of whom had reported at least one crime in the last year. The longitudinal data for t 1 to t 2 (time gap: two years) included 775 juveniles with complete data with regard to delinquency. Of the 189 offenders at t 1 , 114 (ca. 60%) remained in the longitudinal data file, and about one half (48.2%) reported of having committed another crime at t 2 .

Study Variables
The selection of variables, here possible risk factors, for the following analyses was based on a publication with the title "Risk factors for the development of antisocial behavior in childhood and youth" (German translation: Risikofaktoren für die Entwicklung dissozia len Verhaltens in der Kindheit und Jugend; Stemmler et al., 2018). In this chapter, which included an introduction to the concept of risk factors and their characteristics, also data from the project "CURL" were analyzed. For the following analyses, all bivariate associa tions were included that showed any significant correlation between the risk factors and delinquent behavior. Delinquent behavior encompassed behavior that is forbidden under the penal law; this includes property crime, vandalism and violence. An "offender" was defined as a person that reported having committed at least one crime in the past year. A "non-offender" was a study person who had not committed any crime in the past year.

Results of the First Order CFA
In 5th grade Antisocial Attitudes together with Delinquent Peers were significantly associ ated with Offender Status two years later (cf. . The results of the first order CFA can be found in Table 1. Both goodness-of-fit statistics suggested a poor fit: LR = 77.72, df = 4, p < .001; χ 2 = 161.90, df = 4, p < .001; AIC = 127.570; BIC = 127.888. Therefore, with respect to the calculated expected frequencies which were determined under the assumption of the null hypothesis that interaction effects do not exist, one Type and two Antitypes emerged. The Antitypes suggested that there were fewer observed frequencies than expected under the null hypothesis of independence. Based on the expectancy of the null hypothesis the pattern "-+ -" was an Antitype, meaning that there were fewer juveniles than expected to have no Antisocial Attitudes but being associated with Delinquent Peers and not being an Offender. In addition, another Antitype emerged for configuration "+ --", indicating that there were fewer juveniles than expected but who were not an Offender and not being associated with Delinquent Peers but having Antisocial Attitudes. The Type "+ + +" was more interesting in terms of criminological research: More juveniles than expected under the null hypothesis committed an Offense who also showed Antisocial Attitudes and who spent their leisure time with Delinquent Peers. At the same time the configuration "+ + -" was almost a Type with p = .015 (it missed the Bonferroni adjusted alpha level), showing that there were juveniles with Antisocial Attitudes who socialized with Delinquent Peers but reported not having committed an Offense, maybe those juveniles underreported their committed delinquent acts.

The R-Package confreq and Other CFA Software
Alexander von Eye (Michigan State University) has written a CFA program (von Eye, 1998) which is available as a shareware. This program was written in FORTRAN 90 and runs on the DOS level and is therefore suitable only for Windows PCs. The program starts by double-clicking on the file cfa.exe. The "von Eye program" is controlled by typing numbers into the program. After it starts, the user needs to proclaim whether the data will be entered via a file <= 1 > or interactively <= 2 >. The "von Eye program" can display a design matrix, if requested (without the constant); it is easy to use and allows to run two sample CFAs in addition to zero order and first order CFAs. Funke, Mair, and von Eye (2007) wrote the first R package called CFA. However, this R package has not been updated for use in newer major R base versions. We, therefore, recommend that the new R package confreq should be used. confreq is the abbreviation for configural frequencies. The package was written by Jörg-Henrik Heine (Heine et al., 2020); it is constantly updated and maintained. The name confreq avoids a mix-up with Confirmatory Factor Analysis which is also often abbreviated as CFA. The package confreq is now available (Version 1.5.5-2) from the repositories on CRAN 1 and therefore suitable for the latest R version 4.0 (R Core Team, 2020).
Within R, one can read in a frequency table by typing in the pattern and their fre quencies into a spreadsheet file. Such form of data are typically named as tabulated data, where the rows represent all possible combinations of the variables and the rightmost column holds the respective frequencies. To prepare the data to be imported into R, save the spreadsheet as an csv-file into your current R workspace directory by naming it for example as "5thgrade.csv" (additional materials, including the R syntax and the Excel files, are provided in Supplementary Materials). For correct processing the tabulated data with confreq the header of the rightmost column holding the pattern frequencies must be named "Freq".
The following R syntax will lead to the results of Table 1. 1) See https://cran.r-project.org/web/packages/confreq/citation.html # reading in an EXCEL file in csv-format # order1 <-read.table("5thgrade.csv", sep=";", header=TRUE, quote="\"") order1 # you need to load the R-package confreq # do not use zeros as configural patterns! library("confreq") # convert the data to patterned frequencies order1pat<-dat2fre(fre2dat(order1)) order1pat # first order CFA resd1 <-CFA(order1pat,alpha=0.05, form="~ Offender + Delinqpeer + Attitude") summary(resd1) # inspect the design matrix of the first order CFA resd1$designmatrix The resulting design matrix for the base model (see last syntax line in the box above) looks like the following: In the first column one can see the constant, followed by the effect coded main effects for Antisocial Attitude, Delinquent Peers and Offender Status.

Results of the First Order CFA With One Covariate
The underlying idea is that covariates are employed in the loglinear base model to compute the expected frequencies. As the first covariate we added Parental Engagement, a scale from the Alabama Parenting Questionnaire (Frick, 1991). While expressing CFA in terms of LLM, the covariate is added by simply extending the GLM Equation 1 (see Glück & von Eye, 2000): with c = covariate vector, β c = parameter for the covariate.
The resulting model belongs to the family of nonstandard log-linear models. In the literature there exist caution with the ambiguous interpretation of parameters from such nonstandard models. Mair (2007) offers a solution by looking at the effects coded in the design matrix and determining the numerical contribution of single effects.

As the Equation 15
shows, the covariate is simply added as a column to the design matrix of the log-linear model; there is one score per covariate for each cell. Usually, the cell means of the continuous covariate are used; however, any other statistics may also be applied, for example, medians, percentages, probabilities or even categorical covariates and interactions of covariates with other variables are possible. If a cell has t cells and design matrix X contains k vectors (including the constant), the maximum number of covariates is t -k -1.
The fit is still not perfect; there are significant differences between the observed and expected frequencies; however, the AIC and BIC were reduced and we lose one degree of freedom: LR = 48.32, df = 3, p < .001; χ 2 = 56.71, df = 3, p < .001; AIC = 100.173; BIC = 100.570. The Antitype "-+ -" vanished because the expected frequencies got closer to the observed one. A new Type evolved: "+ -+" because the expected and observed frequencies are deviated further apart; the difference between the two changed from 7.57 to 16.86. The interpretation of the new Type needs to involve the configuration's covariate, meaning that all juveniles in this cell are adjusted to the covariate Parental Engagement. It maybe that controlling for parents' engagement leads to Offenders with Antisocial Attitudes who do not associate as much with Delinquent Peers as expected under the null hypothesis. The remaining Antitype and Type stayed the same. It is necessary to correct for multiple testing. In confreq either the Bonferroni adjustment or the Holm's alpha protection can be applied (cf. Stemmler, 2020).

Results of the First Order CFA With Two Covariates
Next to Parental Engagement, another covariate, the use of Corporal Punishment, was added to the design matrix (see far right column). The resulting design matrix with the means of Parental Engagement and Corporal Punishment looks like the following: Let's have a look at the R-syntax with two covariates: ##### the covariates from CURL 5th Grade --------------------co <-read.csv2(file = "covariate.csv", header = TRUE) co # to run a CFA with two covariates, here Parental Engagement and Corporal Punishment erg5_CP <-CFA(order1pat,cova = cbind(co$Apq_pe,co$Apq_cp)) summary(erg5_CP, showall = T, type = "pChi") # have a closer look at the design matrix erg5_CP$designmatrix The results of the first order CFA with two covariates can be found in Table 3. With two covariates, the significant differences between the observed and expected fre quencies vanished. We invested another degree of freedom but we have a reasonable fit: LR = 1.60, df = 2, p = .449; χ 2 = 1.69, df = 2, p = .428; AIC = 55.45; BIC = 55.92. No Types or Antitypes emerged. Obviously, high or low covariate values corresponded to high or low observed cell frequencies pulling the observed and expected frequencies together.
Although it is not a perfect association, high values of Parental Engagement were mainly present for juveniles with no Antisocial Attitudes and high Corporal Punishment was associated with Delinquent Peers. In CFA, covariates which correlate with the residuals decrease the differences be tween the observed and expected cell frequencies. However, covariates which do not correlate can lead to the emergence of new Types and Antitypes (Glück & von Eye, 2000;von Eye, Mair, & Mun, 2010).

Conclusions
We demonstrated the use of CFA with covariates. CFA is a very useful tool in the realm of person-oriented research which is related to other statistical methods which analyze patterns or configuration of information, like latent-class analysis (LCA), latent profile analysis and general growth mixture models (GGMM). GGMM are basically growth curve models performed for different latent classes (Stemmler & Lösel, 2015). All meth ods have in common, that they try to explain unobserved heterogeneity in groups. The appropriateness of such models is usually tested using goodness-of-fit measures such as information indices, for example, the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC), and their derivatives (von Eye & Wiedermann, in press).
Compared to other person-oriented data analysis approaches, CFA can be distinguish ed according to different aspects. First, in comparison to LCA and the mixture models associated with it (Yamamoto & Everson, 1995), a central difference is that these models are probabilistic models in which probabilities are modeled or estimated, while in CFA observed pattern frequencies are compared with expected ones. The goal of modeling in such probabilistic models is to assign each person to a predefined number of latent classes on the basis of his or her response pattern with a certain probability, whereby a maximum assignment probability can usually be determined for one of the latent classes. The (ideal) model conception consists here in the assumption of disjunctive and exhaus tive person classes of initially unknown size. In this sense, LCA (and other mixture models) can be viewed as a procedure for model-based data clustering, through the appli cation of which individual units of study (persons or objects) within the total sample can thus be grouped into subgroups (Fraley & Raftery, 2002). Thus, second, with LCA and mixture models, the primary goal is optimal model fit, whereas with CFA the focus is primarily on non-fitting models and the interest is primarily in residuals analysis. With LCA and mixture models, the patterns of association or structure of dependence between variables are supposed to disappear by assuming a fitting number of latent classes, thus explaining the associations between variables. CFA, on the contrary, focuses on overor underfrequented configurations (patterns) and, to that extent, requires a non-fitting model to identify types and/or antitypes and thus engages in residual analysis.
The use of additional covariates makes CFA even more flexible. In particular, if one investigates variables of different scale levels (e.g., categorical and interval level variables). In the person-oriented research, a covariate which is significantly related to the variables under investigation brings the observed frequencies closer to the expected frequencies; this results in a diminishing number of Types and Antitypes. Moreover, this disappearance is probably causally related. von Eye and Wiedermann (2016) wrote "specifically, whenever Types or Antitypes disappear after the design matrix was exten ded, the hypothesis can be entertained that the add-on effects are explanatory for the disappeared Types or Antitypes" (p. 168). In some cases, a new pattern of emerging Types and Antitypes appears, depending on the correlation of the covariates with the residuals of the model without continuous variables. Although usually the mean or median of a single continuous variable is added to represent the cases in a cell. Notwith standing, CFA with covariates still belong to the person-oriented approach, because the person or objects in a cell are still considered to be indivisible, only more information is added; moreover, it is also still possible to add a covariate as another categorical variable functioning, for instance, as a stratification variable (cf. von Eye, 2002, Chapter 10).
Using the log-linear modeling (LLM) approach to CFA, covariates are simply added to the design matrix of a first order CFA by adding columns of means, medians or even percentages.
In addition, the use of the R-package confreq was demonstrated. When reading in the patterned frequencies in an Excel sheet, the use of covariates is straight forward. One can use as many covariates as one wishes, depending solely on the spare degree of freedoms left. Together with confreq CFA is a very powerful statistical tool in person-oriented research, however, it should be mentioned that confreq does not allow one yet to perform a Bayesian CFA. Next, to the first order CFA, other versions of the CFA are available for example, the two-sample CFA, prediction CFA (P-CFA) in longitudinal data, Configural Mediator Model (Stemmler, 2020) or functional CFA (fCFA; von Eye & Mair, 2008), which enable to blank out extreme outlier cells (cf. Stemmler & Heine, 2017) and CFA is even a complimentary tool for analyzing tree structures based on CHAID (Stemmler, Heine, & Wallner, 2019).

Funding:
The authors have no funding to report.