Quantile Regression-Based Multiple Imputation of Missing Values — An Evaluation and Application to Corporal Punishment Data

Quantile regression (QR) is a valuable tool for data analysis and multiple imputation (MI) of missing values – especially when standard parametric modelling assumptions are violated. Yet, Monte Carlo simulations that systematically evaluate QR-based MI in a variety of different practically relevant settings are still scarce. In this paper, we evaluate the method regarding the imputation of ordinal data and compare the results with other standard and robust imputation methods. We then apply QR-based MI to an empirical dataset, where we seek to identify risk factors for corporal punishment of children by their fathers. We compare the modelling results with previously published findings based on complete cases. Our Monte Carlo results highlight the advantages of QR-based MI over fully parametric imputation models: QR-based MI yields unbiased statistical inferences across large parts of the conditional distribution, when parametric modelling assumptions, such as normal and homoscedastic error terms, are violated. Regarding risk factors for corporal punishment, our MI results support previously published findings based on complete cases. Our empirical results indicate that the identified “missing at random” processes in the investigated dataset are negligible.

point out a gap between modern statistical methods and routine practice in applied psychological research. They argue that robust statistical methods are not widely used, and that analysis of variance (ANOVA) or ordinary least squares (OLS) regression are still employed frequently, even when the underlying parametric modelling assumptions are not met (e.g., in the presence of heteroscedasticity, outliers in the response, or other forms of non-normality like skewness or multi-modality). This is also the case when imputing missing values in a dataset, where creating multiple imputations (MI) of each missing value based on Rubin's (1987) theory is one of the state-of-the-art approaches (cf. Schafer & Graham, 2002). Most of the currently avail able MI procedures involve strong parametric assumptions, which are often not met by empirical data in various fields of the social sciences: Haupt et al. (2014) discuss methods to account for outliers in the response and heteroskedasticity in their analysis of risk fac tors for corporal punishment of children by their fathers; Kleinke, Reinecke, and Weins (2020) analyse highly skewed longitudinal data on juvenile delinquency with missing values and compare modelling results based on normal model-based MI to results based on imputation methods, which rely on more suitable distributional assumptions. Further examples are response times in psychological testing (e.g., Ranger & Wolgast, 2019) where empirical distributions are typically not normal and often difficult to parameterise -or research on substance abuse and (behavioural) addiction (e.g., Chodkiewicz et al., 2020;Sugiyama et al., 2020), which commonly involves skewed empirical distributions. Though, fully parametric normal model-based MI is still employed (e.g., Wayman, 2002), approaches beyond these methods are often more suitable. An argument frequently made in applied research is that MI is "robust" in general. See the discussion regarding the self-correcting property of MI (i.e., masking bias in the model of scientific interest or analysis model due to violated assumptions by overestimated standard errors) in Kleinke, Reinecke, Salfrán et al. (2020, Chapter 4.5). However, this is true only to a certain extent. Unfortunately, even imputation methods, which are believed to be robust against mild to moderate violations of parametric assumptions like predictive mean matching (PMM) for example, tend to produce biased inferences, when empirical data are highly skewed (Kleinke, 2017;Yu et al., 2007). Yu et al. (2007) recommend that the use of imputation methods with mis-specified distributional assumptions should be avoided, when the given empirical data deviate too strongly from these assumptions. Alternatives to nor mal model-based MI are quantile regression (QR)-based MI (Geraci, 2016a(Geraci, , 2016b(Geraci, , 2017 and MI of missing values based on (semi-parametric) generalized additive models for location, scale, and shape (GAMLSS; see e.g., Rigby & Stasinopoulos, 2005;Stasinopoulos & Rigby, 2007; for an implementation in R, see Salfrán & Spiess, 2018).
The two approaches provide flexibility in modelling parametric distributions (GAMLSS-based MI) -location, scale and shape of the distribution, to be more specific -or avoid parametric distributions (QR-based MI) and focus on modelling conditional quantiles of interest directly (Koenker, 2005;Koenker & Hallock, 2001). Despite their obvious advantages over normal model-based MI, systematic evaluations of QR-and GAMLSS-based MI are still scarce. Notable exceptions are Bottai and Zhen (2013) and Geraci and McLain (2018). The former paper investigated different sample sizes and error distributions and found that QR-based MI was superior to standard imputation methods like PMM. The latter paper examined a QR-based MI approach for bounded data, and found superior performance compared to PMM. Little is known, yet about QR-based MI in the case of discrete data (see the discussion in Bottai & Zhen, 2013, regarding QR-based MI of count data) or quasi-continuous data: In the social sciences, scale scores are often derived from ordinal data (e.g., the mean of Likert scale items). Resulting scale scores are often treated as quasi-continuous and analysed by methods for continuous data.
This paper evaluates QR-based MI in a setting common in the social sciences, where empirical data are frequently not genuinely continuous, normal, and homoscedastic and often contain outliers in the response. Our contribution is two-fold: First, we provide a Monte Carlo simulation in which we compare QR-and GAMLSS-based MI to normal model-based MI and PMM. Second, we replicate and extend the empirical analysis of the risk factors for corporal punishment of children by their fathers in Haupt et al. (2014) and Fritsch et al. (2019) with QR-based MI. In both papers, the results were based on the complete cases only and the response was derived from Likert scale items.
The paper is structured as follows: First, we briefly introduce the theoretical back ground of MI and QR. We then introduce the QR-and GAMLSS-based MI approaches. In Section 3, we present and discuss the results from our Monte Carlo simulation. In Section 4, we replicate the results in Haupt et al. (2014) and Fritsch et al. (2019) based on QR-based MI and contrast our results with the findings for the complete cases. A general discussion concludes.

Multiple Imputation
Multiple imputation based on Rubin's (1987) theory is a three-step process: First, each missing value is imputed multiple times by plausible values, so that researchers end up with M complete datasets. Plausible in this context refers to any value that leads to unbiased statistical inferences. 1 Second, these M datasets are analysed separately by any complete data technique such as OLS regression, or QR and yield M sets of modelling results. In the final step, the M sets of modelling results are combined into an overall result based on combination rules such as the large-sample formula given in Rubin (1987) or adaptations for small samples (Barnard & Rubin, 1999). For a detailed discussions, see 1) See Rubin (1996) for a discussion of criteria that are required to create proper imputations. Kleinke, Reinecke, Salfrán et al. (2020, Chapter 4). The combination formulae yield point estimates and standard errors for each model parameter θ p , p = 1, …, P. The combined point estimate is simply the average of the M estimates. Corresponding standard errors are obtained by combining a within-imputation and a between-imputation variance com ponent. The latter component reflects the amount of missing information with respect to θ p : Fewer information to predict the unobserved values leads to higher uncertainty in the imputations and a larger between variance. Note that if the missing values could be predicted without any uncertainty, the estimator θ p, m obtained from the m = 1, …, M imputed datasets would not vary over the datasets -resulting in a between variance of zero.
More comprehensive introductions to MI and helpful resources for applied research ers are given in: Kleinke, Reinecke, Salfrán et al. (2020), who provide a general introduc tion and review recent developments and various applications of MI. Chapters 1, 4, and 5 in Little and Rubin (2020) summarize the problem of missing values, missing value patterns and mechanisms; additionally, a taxonomy of methods for imputing missing values and a discussion of the rationale of MI in comparison to other methods to account for uncertainty in parameter estimates due to missing values are given. Applications of MI in various contexts including linear and generalized linear models, categorical data analysis, survival analysis, structural equation modelling, longitudinal data analysis, and complex survey settings are discussed in Raghunathan et al. (2018).

Assumptions
Approaches for MI typically assume that the mechanism generating the missing values in a variable y is ignorable (Rubin, 1976(Rubin, , 1987; see also Kleinke, Reinecke, Salfrán et al., 2020, Chapter 2): Observing or not observing values in y can be seen as the result of a dichotomous process generating a binary response r, which obtains the value '1' if the value of the corresponding variable is observed and '0' otherwise. Transitioning from the realized sample to the observed subsample can be interpreted as an additional sampling step. The underlying process, however, is usually unknown during data collection and assumed to be a random process. When the mechanism generating the missing values is ignorable, datasets can be analysed without the need to explicitly model the mechanism. Stated differently, an ignorable mechanism can be assumed when the missing values are generated at random, given all observed information (i.e., the probability of not observing certain values of a variable depends on observed information only). To ensure the credibility of this assumption, applied researchers typically specify rich imputation models and include all variables related to patterns of missing values into the imputation model -even if these variables are omitted from the analysis model (cf. Collins et al., 2001).

How Imputations Are Generated
MI based on Rubin's (1987) theory can be considered a Bayesian approach: Imputations are generated by drawing from the posterior predictive distribution of the variables whose values are missing given all observed values. From a mathematical point of view, the posterior predictive distribution is usually difficult to approximate, because in most datasets different types of variables are present, for example, continuous, truncated, categorical or count, and missing value patterns are usually not monotone.
A popular approach to handle these situations is the mice (multiple imputation by chained equations) framework (van Buuren, 2018;van Buuren & Groothuis-Oudshoorn, 2011;van Buuren et al., 2006), where users can specify separate conditional imputation models for each variable with missing values. The models can be specifically tailored to the distribution and type of each variable (e.g., methods norm for a standard linear regression model, logreg for a logistic regression model or polyreg for a polyto mous regression model). The software then cycles iteratively through all variables with missing values and imputes them step by step using predictions from a pre-specified regression model. Values of all other (observed and imputed) variables are used by the procedure. The approach is implemented in the R package mice (van Buuren & Groothuis-Oudshoorn, 2011). Extensions of the functionality of the package are available in other packages such as Qtools (Geraci, 2016a(Geraci, , 2016b for QR-based MI and Impu teRobust (Salfrán & Spiess, 2018) for GAMLSS-based MI. The methodology underlying both approaches is described in the next two subsections.

QR and QR-Based MI of Missing Values
The QR model assumes a linear additive functional form for the τ-quantiles of the re sponse y i conditional on the P predictors x i , where i = 1, …, n are the sample observations. Define the conditional τ quantile as where β τ denotes the parameter corresponding to the τth quantile. Parameter estimates β τ are obtained by minimizing the asymmetrically weighted loss function where ρ τ ⋅ governs the weighting. Choosing τ = 0 . 5 yields the conditional median (for further details, see Koenker, 2005).
The following data generating process illustrates the exact fit property of QR: where F τ, i is some distribution function and F τ, i −1 its inverse. Note that if q is the condi tional τ quantile, such that τ = F τ, i q , it holds that q = F τ, i −1 τ and the conditional τ quantile can be recovered by inverting the distribution function at τ. The weak assump tions about the distribution of the errors and the fact that quantiles of the conditional distribution of y i are modelled implies advantages over restrictive parametric methods when distributional assumptions of the latter are violated (e.g., in case of outliers or heteroscedasticity). A disadvantage of QR is that a sufficiently large sample size is required for the model parameters to be identified because all parameters are estimated for each quantile. 2 To illustrate QR-based MI of missing values, denote the response vector by y, the matrix of P predictors by X, and assume that all variables contain missing values (or incomplete cases). Define the dataset with missing values by U = y, X , where each of the i = 1, …, n lines of the dataset is at least partially observed. The columns of the dataset U are referred to by q = 1, …, Q, with Q = P + 1. Note that the total number of observations in the dataset is N = n⋅Q. Denote the total number of missing values as N mis and the number of observed values (or complete cases) as N obs = N − N mis . The number of missing values in each column of the dataset is referred to as n q, mis , such that N mis = n 1, mis + … + n Q, mis .
We aim to impute the missing values in U and use them in the analysis model. The matrix of imputed values is represented by V . To simplify the exposition, assume that U is sorted such that the lines with the observed values are given first and refer to the individual lines via u j , j = 1, …, J (J = min n 1, obs , …, n Q, obs ). The remaining lines with the missing values are denoted by u k , k = 1, …, K (K = N − J ). The definition of v j and v k is analogous.
In the following, we describe function mice.impute.rq from package Qtools for QR-based MI in greater detail. Note that we do not consider approaches restricting regression quantiles to avoid quantile crossing. Based on the defined notation, function mice.impute.rq proceeds according to the Algorithm 1 (Geraci, 2016a).
Algorithm 1 details the steps required for QR-based MI of missing values. First, choose the number of iterations C and the number of bootstrap samples M. For bootstrap iteration m = 1, …, M, proceed as follows: Define a temporary dataset D m, c ; in the first iteration c = 1, the lines of the dataset without missing values u j are used. At subsequent iterations 2, …, C, the imputed values from the previous iteration V m, c−1 are employed and the following five steps are used on all columns q = 1, …, Q: In (1), draw a bootstrap sample B m, c of size n q, mis from the lines of D m, c with replacement. In (2), partition B m, c into its q th column b m, c, q and all other columns B m, c, −q . In (3), draw K random values τ 1 , …, τ K from the uniform distribution U ε, 1 − ε , with 0 < ε < 0 . 5 (note that the function default is ε = 0 . 001). In (4), fit a QR model based on the partitioned bootstrap (5), use the parameter estimates β τ from (4) to impute the missing values for each column of the dataset with missing values U K . Collect the imputed values for all q = 1, …, Q columns and assign them to V m . Then, iterate the five steps until the pre-specified number of iterations C is reached. 3 The algorithm yields M datasets with imputed values V 1 , …, V M .

GAMLSS and GAMLSS-Based MI of Missing Values
De Jong (2012) and de Jong et al. (2016) have proposed a MI procedure based on a class of semi-parametric GAMLSS (Rigby & Stasinopoulos, 2005;Stasinopoulos & Rigby, 2007), which is implemented in R package ImputeRobust (Salfrán & Spiess, 2018) and ex tends the functionality of the mice package. Compared to standard normal model-based MI, which only considers modelling the location of the distribution, GAMLSS-based MI also allows to account for the scale, and the shape of the distribution and can be better tailored to a given empirical dataset.
In the following, we slightly extend the notation from the previous subsection to illustrate  Assume that the conditional distribution of the qth column u q of dataset U given the observed values u J , q and the other columns U −q is Parameters θ 1 and θ 2 are the location and scale parameters, and θ 3 , …, θ S are shape parameters -depending on the specified distribution D. Imputations for the missing values in column q of the dataset U K can be obtained by where θ q is a parameter vector and g q ⋅ a link function relating the location, scale, and shape parameter of the missing values in column q to the dataset U . Equation 5 is additively decomposed into a linear parametric part with design matrix U q and corresponding coefficient vector γ q and a sum of smooth functions ℎ j, q ⋅ of the observed values u J , q .
Since package gamlss provides different approaches to model ℎ ⋅ , such as cubic smoothing splines, penalized splines, and local regression, the following algorithm pro vides a general description of GAMLSS-based MI (Salfrán & Spiess, 2018).
Algorithm 2 details GAMLSS-based MI of missing values. First, choose the number of bootstrap samples M, set m = 1, use the lines of the dataset without missing values U J as initial bootstrap sample B 1 , and choose starting values θ 0, k, q 1 , θ 0, k, q 2 , θ 0, k, q 3 , …, θ 0, k, q S for the parameters of the distribution D. In (1), the penalized likelihood function ℓ pen is specified based on the distribution D. In (2), maximise ℓ pen based on B 1 to obtain parameter 3) Usually around 20 iterations are sufficient (for further details, see van Buuren, 2018; van Buuren & Groothuis Oudshoorn, 2011). 4) More thorough descriptions of the method and its theoretical underpinnings are provided in Salfrán and Spiess (2018) and Kleinke, Reinecke, Salfrán, et al. (2020). Chapter 6 of the latter reference discusses settings where GAMLSS-based MI is suitable. based on a linear regression model. Details about the implementation in package mice (function mice.impute.pmm) are given in van Buuren (2018, Chapter 3).
PMM is considered a semi-parametric imputation procedure: The parametric linear model is used only as a means for matching the incomplete case to a potential donor. Then, actually observed values are imputed. This preserves many properties of the observed part of the data, which makes PMM robust against mild to moderate viola tions of distributional assumptions of the normal linear regression model. This was systematically evaluated by Kleinke (2017) for different degrees of skewness, different sample sizes, and varying percentages of missing values. Their results indicated that: 1) especially when sample size was small, small donor pools yielded better results in terms of unbiased parameter estimates and standard errors compared to large donor pools; 2) PMM generally worked well, unless the data were highly skewed and/or more than roughly 20% to 30% of the data were imputed; 3) PMM generally performed better when sample size was large.
Disadvantages of PMM are that the approach can neither extrapolate beyond the range of the observed data, nor interpolate when regions of the data are sparsely popula ted. Additionally, PMM fails when there are substantial violations of the assumptions of the normal linear regression model.

Monte Carlo Evaluation of QR-Based MI Overview
In this section, we evaluate QR-based MI by Monte Carlo Simulation based on a typical setting in the social sciences: Skewed and heteroskedastic data. The underlying data were taken from Haupt et al. (2014), who investigated risk factors for corporal punishment by fathers with a QR model. We proceed as follows: First, draw 1000 bootstrap samples with replacement from the complete cases and fit a QR model to each of the bootstrap samples. Second, use the average parameter estimates as a proxy for the unknown 'true' population parameters. Third, introduce that values are missing at random (MAR) to the bootstrap samples by deleting a certain fraction of values. Finally, evaluate QR-based MI regarding bias in point estimates and coverage rates (CR) of the 95% confidence intervals of the parameters and contrast the results with the complete data estimates and results obtained from other established MI methods (standard normal model-based MI, PMM, and GAMLSS-based MI).

Hypotheses
The advantages and disadvantages of QR in comparison to GAMLSS are discussed in Kneib (2013) and Rigby et al. (2013). QR, for example, is distribution-free, while GAMLSS introduce model structure and are more parsimonious regarding the data requirements -while also bearing a higher risk of mis-specification. To our best knowledge, QR-and GAMLSS-based MI have not been compared systematically in Monte Carlo simulations, yet. We expect the preferred method to depend on the specific setting. In our simulation, we expect QR-and GAMLSS-based MI to yield comparable results. We assume that both approaches will outperform PMM and standard normal model-based MI -especially when quantiles farther away from the center of the distribution of the response are considered: PMM and normal model-based MI create imputations based on predictions for the conditional mean. Thus, both normal model-based MI and PMM should perform well in the vicinity of the center of the distribution.

Method Design
Our simulation has a 2 × 2 × 4 (percentage of missing values × missing value pattern × imputation method) design: From the bootstrap samples, we deleted 25% or 40% of values either from the response only, or from the response and one of the predictors following a MAR pattern (for details, see Section "Procedure"). We imputed missing values by QRand GAMLSS-based MI, PMM, and standard normal model-based MI.

Evaluation Criteria
We evaluated the approaches based on the goal to yield unbiased statistical inference: Bias in the parameter estimate was defined as the difference between the true parameter θ and the average estimate θ across the bootstrap samples. Bias was reported relative to the true population parameter as %Bias = Bias θ ⋅100% .
Coverage Rate (CR) was defined as the percentage of 95% confidence intervals that included the true population parameter. Schafer and Graham (2002) deem CR below 90% as seriously low. Undercoverage may result from biased point estimates and/or an underestimated variance. The confidence interval could either be too narrow to include the true parameter, and/or it can be shifted too much to the left or to the right to include the true parameter. Together, bias and CR allow to assess the accuracy of statistical inferences.

Sample and Measures
Our simulation is based on a sample taken from the longitudinal Erlangen-Nuremberg Development and Prevention Study. The data include 609 families (675 preschool chil dren) from the North-Bavarian cities of Erlangen and Nuremberg (for more details, see Lösel et al., 2009). The response in our model corporal punishment by fathers (CP), as well as two risk factors inconsistent discipline (ID) and other discipline practices (OD) were assessed using a German version of the Alabama Parenting Questionnaire (APQ; Lösel et al., 2003). Items were rated on five-point scales ranging from 1 ("never") to 5 ("always"). Scale scores were obtained by calculating the means of the respective items. The corporal punishment scale was based on three items such as "You spank your child with your hand when he/she has done something wrong". The empirical distribution of the cor poral punishment scale is summarized in Figure 1. 5 Inconsistent discipline (six items) summarizes parenting behaviors like "The punishment you give your child depends on your mood". Other discipline practices (seven items) refer to actions such as "You take away privileges or money from your child as a punishment". A further risk factor of interest was socioeconomic status (SES). The computation of the SES index was based on Geißler (1994). Finally, our model included the child's gender by the gender indicator (FEMALE).

Figure 1
Needle Plot of the Empirical Distribution of the Response Corporal Punishment 5) Note that Erlangen-Nuremberg is a relatively affluent area. More extreme CP scores might be expected in more deprived urban areas.

Procedure
We simulated 1000 bootstrap samples from the original N = 485 complete cases. Each bootstrap sample had a size corresponding to the original sample size N = 675. We then introduced MAR patterns (Rubin, 1976). Depending on the respective condition, we deleted either 25% or 40% of the values of the response CP, and/or the predictor OD. Values were deleted using the following rule: For each observation i we computed the probability of a missing value Pr i mis with the logit model Coefficients in Equation 7 were chosen to ensure probabilities of missing values ranging from about 0.2 to 0.8 -depending on the individual SES scores. In our simulation, partici pants from lower social classes were assigned larger probabilities of missing values. Depending on the respective condition, we sampled the necessary number of cases with selection probabilities Pr i mis and deleted the respective observations from the data sets. An overview of the resulting missing value patterns is displayed in Table 1. Missing values were imputed with QR-and GAMLSS-based MI, PMM, and normal model-based MI by using functions implemented in packages mice, Qtools, and ImputeRobust. Unless stated otherwise, we used default settings of the functions. All other columns of the dataset were used to predict missing values in a particular column. As common practice (e.g., Kleinke, 2018), the number of imputations was set to the percentage of missing values. We then compared parameter estimates based on the imputed datasets with the means of the 1000 complete data estimates. We obtained coefficient estimates from a QR model, in which we regressed the response CP on SES, FEMALE, ID, and OD regarding quantiles τ ∈ . 2, . 3, . 4, . 5, . 6, . 7, . 8 . Heteroscedasticity-robust standard errors were computed as in Haupt et al. (2014). Figure 2 summarizes the results on bias in the point estimates and displays the average amount of relative absolute bias across all model parameters. Following Forero and Maydeu Olivares (2009), bias was considered substantial if it exceeded 10% of the true population parameter. Overall, bias generally increased: (i) when the response and addi tionally predictor values were missing; (ii) with the number of missing values.

Results
When considering the different approaches for MI, both PMM and normal mod el-based MI had only a small bias in the center part of the distribution of the response (i.e., τ ∈ . 4, . 5, . 6 ), on average. As expected, bias increased when moving towards the tails of the distribution for both approaches. QR-and GAMLSS-based MI produced widely unbiased results across all quantiles and Monte Carlo designs. Figure 3 displays the corresponding CR. All approaches produced large enough confi dence intervals (some even slighly above the nominal 1 − α percent), such that a bias in point estimates was alleviated by a wider interval. CR was sufficiently large across all parameter estimates and quantiles, even when the percentage of missing values was 40%.

Percent Parameter Bias Produced by the Four Missing Value Methods in the Four Missing Value Scenarios
Note. Results were averaged across all parameter estimates; pmm is predictive mean matching; rq is quantile regression-based MI; gamlss is MI based on generalized additive models for location, scale, and shape; norm is Bayesian linear regression. % Bias above 10% of the true population parameter was considered substantial.

Coverage Rates (CR) of the 95% Confidence Intervals Produced by the Four Missing Value Methods in the Four Missing Value Scenarios
Note. Results were averaged across all parameter estimates; pmm is predictive mean matching; rq is quantile regression-based MI; gamlss is MI based on generalized additive models for location, scale, and shape; norm is Bayesian linear regression; horizontal line indicates nominal CR of 95%. CR below 90% were deemed substantial.

Discussion
Normal model-based MI produced unbiased inferences in the center region of the re sponse. While point estimates in the off-center regions were biased, CR were neverthe less above 90%. The observed behaviour of normal model-based MI is due to the so-called self-correcting property of MI: A mild to moderate bias in the parameter estimate resulting from the mis-specification of the imputation model by assuming normality and homo scedasticity of the error distribution is alleviated by an increase in the corresponding variance estimate. This also lead to accurate CR produced by PMM. This result should be of considerable interest to applied researchers facing datasets with incomplete cases: In practice, valid inference in off-center regions of the response are often of paramount interest -for example, when investigating risk factors for corporal punishment. Typical ly, applied researchers are not only interested in a suitable CR, but also in obtaining accurate point estimates regarding the tails of the distribution. To achieve this, both, the imputation and analysis model need to be able to capture the underlying properties of the data generating process. Our simulations show that this is widely the case for QRand GAMLSS-based MI: Valid inference can be obtained from both approaches under dif ferent missing value patterns and percentages. Following previous recommendations (cf. Kleinke, Reinecke, Salfrán et al., 2020) that the imputation method and the subsequent analysis model should coincide, we recommend to use QR-based MI when the analysis model is a QR model.

Application
In this section, we compare the findings of the two empirical applications by Haupt et al. (2014) and Fritsch et al. (2019), who used the complete cases only with results obtained by QR-based MI. Haupt et al. (2014) considered further predictors in their QR model that we did not include in our simulations above. These were 1) fathers' aggressiveness (AG), measured by the aggressiveness scale of the FPI-R personality questionnaire 2) parental distress (PD), measuring the perceived stress of being a parent, 3) dysfunctional interaction (DI), assessing the parents' perception that the child does not meet their expectations or that they are manipulated by the child, and 4) the age of the father (AGE). For further details on the predictors, see Haupt et al. (2014). CP was regressed on the aforementioned predictors and heteroscedasticity-robust standard errors were computed (Powell, 1991). Regression coefficients were estimated for quantiles τ ∈ . 2, . 3, . 4, . 5, . 6, . 7, . 8 . In addition to Haupt et al. (2014), Fritsch et al. (2019 explored the relevance of further predictors (for an overview and a detailed description, see Table 1 and Section 3 in Fritsch et al., 2019). These variables were age difference of the parents (AGE.DIFF), as well as categorical indicators if the mother or father were older than the other parent respectively, parental involvement, positive parenting, monitoring and supervision by the parents, a measure of the 'difficulty' of the child, fathers' life satisfaction, social orientation, performance orientation, excitability, inhibition, levels of stress, health concerns and physi cal complaints, candor, extraversion, and emotionality. Fritsch et al. (2019) focused on a prediction-centered approach and modelled the conditional mean of CP. We considered changes in the predictive performance when using the dataset obtained from QR-based MI.

Missing Values
Complete case analysis can be expected to yield unbiased statistical inferences in cer tain setting only, for example when only values in the response are missing or when the missing values are missing completely at random (MCAR) in the sense of Rubin (1976). MCAR means the probability of a missing value in a variable neither depends on observed information in the dataset (e.g., on SES), nor on unobserved information (e.g., on the unobserved CP values). The MCAR assumption is violated for example, when variables in the dataset have non-zero correlations with the missing value indicators of incompletely observed variables. We computed correlations of each variable with the missing value indicators of other variables in the dataset and found only small correlations between -.16 and .15. Effects of MAR should therefore be rather minor and we expect only marginal differences from the results reported in Haupt et al. (2014).
For the present analysis we included data from 575 children from families where both parents were present at the beginning of the study and for whom we had complete information regarding AGE, AGE.DIFF, and SES. Missing value patterns (e.g., van Buuren & Groothuis-Oudshoorn, 2011) are displayed in Table 2. About 82% of all cases are complete cases. No answers about CP were available from 79 fathers (about 14%). As many of these fathers also did not answer the other questionnaires (e.g., about parental distress or their aggressiveness), we could use mainly sociodemographic information (e.g., SES and FEMALE) to impute the missing values about CP. For some cases, we could also make use of observed information from the Freiburg Personality Inventory.
We imputed missing values using package mice. Missing values in the response were imputed by function mice.impute.rq from R package Qtools (Geraci, 2016a(Geraci, , 2016b. All other variables were imputed using mice's default settings. Imputation models were automatically generated using mice's default settings (for details, see van Buuren, 2018;van Buuren & Groothuis Oudshoorn, 2011). We then ran 20 iterations of mice's Gibbs sampler and created a total of m = 100 imputations. We fitted the same QR model as in Haupt et al. (2014) to the imputed data and used standard combination rules (Barnard & Rubin, 1999;Rubin, 1987) to obtain the multiple imputation inferences.

MI Results
The top part of Table 3 displays point estimates and corresponding standard errors, the bottom part deviations from the complete case results reported in Haupt et al. (2014). Coefficient estimates and standard errors are similar and the qualitative results across the two analysis are identical: Predictors SES, FEMALE, and PD have negative coefficient estimates, which are close to zero for PD for small values of τ. The strength of the abso lute effects relative to the standard errors is highest for SES and smaller for FEMALE. For the latter predictor, the relative strength of the absolute effect decreases with increasing τ. For PD, the relative strength of the effect increases with increasing τ until τ = 0 . 5 and slightly decreases for larger τ. Positive coefficient estimates are obtained for all other predictors. The largest coefficient estimates relative to the standard errors are obtained for DI, ID, and OD. For AG and AGE, coefficient estimates are generally smaller or only slightly larger than the standard errors. This confirms the risk factors for CP identified by Haupt et al. (2014) for the dataset imputed by QR-based MI: Low SES, DI, ID, and OD -while the limitations of the cross-sectional analysis and the bidirectional design of the analysis noted by Haupt et al. (2014) remain unchanged. Additionally, we reproduced the results of the prediction-centered modelling ap proach in Fritsch et al. (2019). The paper models the conditional expectation of the response CP with multiple linear regression, regression trees and random forests. 6 An 6) For details on the methods, see Fritsch et al. (2019) and the references given there.
out-of-sample evaluation of the three approaches is carried out based on root mean square error, mean square error, and mean absolute error. The paper reports superior predictive performance of random forest-based modelling. We reproduced the analysis Table 3 Parameter Estimates and Heteroscedasticity-Robust Standard Errors of Linear τ-Quantile Regression (QR) Model Based on QR-Based MI, and Comparisons With Complete Case Results of Haupt et al. (2014) Parameter τ = 0 . 2 τ = 0 . 3 τ = 0 . 4 τ = 0 . 5 τ = 0 . 6 τ = 0 . 7 τ = 0 . 8 with the 100 datasets obtained from QR-based MI as follows: First we split the datasets into training and test data; for the test data, we set aside the exact same observations as in the original analysis across all imputed datasets. Second, we fitted a multiple linear regression model, a regression tree, and a random forest to each of the 100 imputed datasets. Note that for the regression trees and the random forests, we tuned the meta-parameters of the algorithm for each of the 100 datasets. Third, we predicted the response and evaluated the predictive performance of all methods by the three prediction error criteria for each of the 100 datasets. Overall, 1) there were only minor differences across the 100 datasets in predictive performance; 2) the predictive performance of modelling based on random forests and multiple linear regression improved slightly; 3) the predictive performance of modelling based on regression trees decreased slightly. We do not report more detailed results here, as there were only minor differences compared to the results reported in Fritsch et al. (2019).

Discussion
Note that in this application, the true distribution of observed and missing values, as well as the mechanism generating the missing values is unknown. This is typical for empirical datasets, where applied researchers are generally unaware of the 'true' impact of missing values on the obtained inferences. We compared estimates based on the complete cases -under the assumption of MCAR with estimates obtained by QR-based MI under the assumption of MAR. Although we cannot rule out MNAR mechanisms, our analyses provide empirical evidence that MAR processes were negligible and did not substantially affect the modelling results. Given that the MAR assumption holds -at least approximately -and that both, the substantive model and the imputation model are correctly specified (an untestable assumption in the analysis of empirical datasets), the reported MI inferences can be considered as unbiased.

General discussion
We have evaluated QR-based MI in a scenario that is typical for the social sciences and imputed incomplete scale scores that were obtained from ordinal Likert scale items. Typ ically, bias in statistical inferences depends on multiple factors: The amount of missing values in the dataset, the mechanism generating the missing values and the resulting missing value patterns, the inter-relationships in the given dataset, and the selected imputation method and model (Raghunathan, 2016;Schafer, 1997). In our simulation, we held the percentage of missing values and the missing value pattern constant within a certain design so that differences in bias and CR can be attributed to the employed imputation methods. Overall, QR-and GAMLSS-based MI yielded the most accurate inferences in comparison to the established MI approaches normal model-based MI and PMM. While the two latter methods yielded unbiased parameter estimates only in the center parts of the distributions, QR-and GAMLSS-based MI also performed well in the off-center regions. Only in the more extreme scenarions (percentage of missing values of 40%) some small biases regarding the lower quantiles were found. Based on our findings, we conclude that if researchers are interested in modelling effects beyond the conditional mean, imputation methods beyond standard normal model-based MI are promising. For example, when considering conditional quantiles, QR-based MI is particularly useful. Overall our findings support the recommendation that the imputation model and the analysis model should coincide. We see the following directions for future research: First, as a sufficiently large sample is typically required by QR to estimate all model parameters in all quantiles of interest sufficiently well, poor performance of QR-based MI could be a byproduct of the relatively small employed sample size of N = 675. It might be instructive to replicate our simulation using a larger sample size to estimate all quantiles of interest of the conditional distribution of the response. Practical guidelines on the required sample size to obtain unbiased statistical inferences should be established by future research -ideally by taking the model complexity into account. Second, QR generally assumes a continu ous response. The corporal punishment variable employed in this paper, for example was obtained by computing the mean of discrete 5-Point Likert scale items. Obviously, this does not yield a continuous variable, but creates point masses around certain values (see also Figure 1). Treating scale scores based on discrete items as (quasi-)continuous is a common practice in the social sciences. To our best knowledge, systematic evaluations of the behaviour of QR do not exist, when the response is not (perfectly) continuous. Our results suggest, that QR-based MI works sufficiently well for missing values that are based on 5-Point Likert scale items, given an adequately sized sample. In future research, the present study could be replicated and the results could be compared to settings where the response is continuous. Additionally, it may be helpful for practitioners to establish guidelines on the number of levels required for a discrete response (e.g., derived from Likert scale items), to consider the response in a QR model (quasi-)continuous. Third, QR-based MI should be evaluated more broadly and more complex settings, such as models including non-linear terms and interaction effects, should be considered. Finally, it will be of interest to applied researchers to compare QR-based MI with other imputation methods implemented in ImputeRobust. Further systematic evaluations and comparisons will help to single out settings in which to use QR-based MI, and when to use other semi-, or even fully parametric approaches.