In this paper, I wish to highlight a problem that has become ubiquitous in scientific applications of machine learning (ML) methods. As far as I can tell, this problem has not yet been singled out for discussion in the literature; but it deserves to be named, clearly described, and widely understood by researchers, who are increasingly relying on ML techniques without always appreciating their limits and constraints.
In a nutshell, the prediction-explanation fallacy occurs when researchers use prediction-optimized models for explanatory purposes, without considering the tradeoffs between prediction and explanation. This is a problem for at least two connected reasons. First, in many typical applications of ML techniques, prediction-optimized models are deliberately biased and unrealistic in order to prevent overfitting, and hence may fail to accurately explain the phenomenon of interest. In other cases, the models have an exceedingly complex structure that is hard or impossible to interpret, which greatly limits their explanatory value. Second, different predictive models trained on the same or similar data can be biased in different ways, with the result that multiple models may predict equally well but suggest conflicting, mutually inconsistent explanations of the underlying phenomenon.
The tension between prediction and explanation is not a novel concept, and has been discussed a number of times in the literature on ML and its applications (e.g., Breiman, 2001; Shmueli, 2010; Yarkoni & Westfall, 2017). However, previous authors have focused mainly on the other side of the issue—namely, the fact that “classical” statistical models designed for accurate explanation tend to perform badly in prediction tasks (for an exception, see the recent contribution by Hofman et al., 2021). My goal is to redress this imbalance, by explicitly discussing the limitations and pitfalls of predictive models when they are used in the context of scientific explanation. As I demonstrate below, the prediction-explanation fallacy can lead to distorted and misleading conclusions—not only about the results of a single analysis, but also about the robustness and replicability of the phenomenon under study.
In what follows, I lay out the terms of the problem and introduce the tradeoffs between prediction and explanation in a non-technical fashion. I continue by presenting some real-world examples from the neuroscience literature, to illustrate different ways in which researchers may commit (or avoid) the fallacy in their work. I conclude with a brief discussion of mitigating factors and methods that can be used to limit or circumvent the problem.
Prediction ≠ Explanation
Researchers across disciplines are expanding their data analytic practices to include a variety of ML methods, from relatively simple techniques such as regularization and cross-validation1 to complex algorithms such as deep learning (for introductions see Berk, 2016; James et al., 2021; for in-depth treatments see Efron & Hastie, 2016; Hastie et al., 2009). In the fields of psychology and neuroscience, ML tutorials and easy-to-use packages are multiplying due to high demand by researchers (e.g., Koul et al., 2018; Kumar et al., 2020; Pargent et al., 2023; Rosenbusch et al., 2021; Yarkoni & Westfall, 2017).
A key reason for the success of these methods is the fact that they often outperform classical statistical procedures when the goal is to predict new outcomes, generalizing beyond the particular dataset used for training (out-of-sample prediction). Classical procedures seek to correctly represent the causal relations among variables and estimate parameters with as little bias as possible—typically via least squares or maximum likelihood estimation—in order to build accurate, well-fitting models of the data-generating process. However, they tend to yield models that overfit the data at hand, and perform badly or fail to replicate when tested on new datasets (Breiman, 2001; Rosenbusch et al., 2021; Yarkoni & Westfall, 2017). Based on these and similar considerations, Yarkoni and Westfall (2017) wrote a provocative paper in which they argued that psychology as a discipline would benefit from “choosing prediction over explanation”. In their view, psychologists should set aside their traditional preoccupation with building accurate (and theoretically elegant) causal models of psychological phenomena, and start focusing more on predictive questions (e.g., how to infer personality traits from online media usage) while embracing out-of-sample prediction as the main benchmark of success.
It goes without saying that predictive accuracy is a key advantage of ML techniques, and that predictive modeling can be remarkably useful in a variety of tasks. A stronger and more deliberate focus on prediction could benefit psychology in a number of ways, as suggested by Yarkoni and Westfall (2017). However, researchers who employ ML in their studies—or use the results of those studies as primary sources—can easily forget that superior predictive performance comes at a cost. The cost is that maximizing the predictive accuracy of a model tends to sacrifice its ability to represent the underlying phenomenon in an accurate and interpretable fashion. Indeed, when accurate prediction is the only criterion of success, the correspondence (or lack thereof) between statistical models and reality becomes irrelevant; “for all practical purposes there is no model responsible for the data” (Berk, 2016, p. 25; emphasis mine).
Traditionally, the ability to generate accurate predictions has been viewed as the hallmark of successful scientific explanations (for overviews see Douglas, 2009; Srećković et al., 2022). In a general sense, this is true; but more faithful explanations are guaranteed to yield more accurate predictions only under ideal circumstances, in which measurement and sampling error are absent or practically negligible. In the real world of noisy data, prediction and explanation become two related but distinct tasks, which typically require different modeling approaches (Shmueli, 2010; see also Bzdok & Ioannidis, 2019). The resulting tradeoffs should be taken into account whenever predictive models are used in scientific applications.
Before continuing, note that here I use the term “explanation” in a broad sense, to include not only causal relations between variables, but also patterns of association that could be regarded as more “descriptive” than strictly explanatory (but still geared toward the scientific understanding of a phenomenon; see e.g., Hofman et al., 2021; Mõttus et al., 2020; Shmueli, 2010). For example, studies that measure sex differences in a given domain or describe networks of correlations among traits (see below) are rarely guided by specific causal hypotheses, and may be largely or entirely agnostic as to the exact causal mechanisms involved. While causal explanation is typically the ultimate goal of scientific research, the careful description of association patterns is a crucial intermediate step in the process. From a statistical point of view, when causal inference and pattern description are performed in the pursuit of (eventual) scientific understanding, they both depend on the correct specification of the underlying phenomenon; from the standpoint of theory development, they both benefit from model transparency and interpretability. For the purposes of this paper, I include both of them under the umbrella of (broad-sense) explanation.
Tradeoffs Between Prediction and Explanation in Machine Learning
To simplify a complicated issue, good scientific explanations should be accurate (i.e., they should correctly represent the structure of the underlying phenomenon) as well as interpretable (i.e., they should be transparent and parsimonious). Of course, these two properties may be in tension with each other, to the extent that simplifications and approximations make explanations more cognitively and/or computationally tractable. In this Section I consider how both of them trade off with predictive accuracy in the context of ML applications (Figure 1).
Figure 1
Tradeoffs Between Prediction and Explanation in Machine Learning
Note. Tradeoffs between predictive and explanatory accuracy are most acute in underparametrized models, which have fewer parameters than the number of training data points and are subject to the classical bias-variance tradeoff. Underparametrized models tend to be easier to interpret and understand, but typically sacrifice explanatory accuracy (i.e., introduce biases) in exchange for enhanced predictive accuracy. Tradeoffs between predictive accuracy and interpretability become especially severe in overparametrized models with more parameters than training data points. Highly overparametrized models may achieve good predictive accuracy and low bias, but their complexity makes them opaque and massively unparsimonious, which greatly limits their explanatory value. While not the main focus of this paper, explanatory accuracy and interpretability can also be in tension with each other (dashed arrow).
First and most relevant to the topic of this paper, tradeoffs between predictive and explanatory accuracy occur when better predictive performance is achieved at the cost of increased model bias. Models optimized to provide accurate explanations of a phenomenon (e.g., unbiased parameter estimates) tend to overfit the training data, and hence generalize poorly to new samples; conversely, strategies designed to avoid overfitting (e.g., regularization) inevitably introduce various kinds of biases, which can improve a model’s performance if appropriately chosen (Schaffer, 1993). Thus, “the ‘wrong’ model can sometimes predict better than the correct one” (Shmueli, 2010, p. 293), and:
“From a statistical standpoint, it is simply not true that the model that most closely approximates the data-generating process will in general be the most successful at predicting real-world outcomes […] a biased, psychologically implausible model can often systematically outperform a mechanistically more accurate, but also more complex, model” (Yarkoni & Westfall, 2017, p. 1100).
This kind of tradeoff is most acute in underparametrized models, i.e., models that have fewer free parameters than the number of data points in the training set. To understand why, it is useful to refer to a key concept in predictive modeling, the bias-variance tradeoff. In brief, the total prediction error of a model is the sum of two distinct sources of error: the square of the average difference between true and predicted values, or squared bias, and the variance of that difference across multiple samples. Biased models yield predictions that are systematically incorrect, whereas the predictions of high-variance models change dramatically depending on the specific dataset used for training. As a rule, simpler models of a phenomenon tend to be more biased, but also more stable with respect to sampling noise. As models become more flexible and complex (relative to the size of the dataset), they increasingly tend to overfit the training data; accordingly, their predictions become less biased but also more variable across datasets. In the classical formulation of the tradeoff, model complexity/flexibility is indexed by the number of parameters estimated from the training data. Depending on the shape of the bias and variance functions, maximizing prediction accuracy may require a compromise between the two sources of error, and hence the introduction of biases that reduce the explanatory accuracy of the model (Figure 2A; see James et al., 2021; Shmueli, 2010; Yarkoni & Westfall., 2017).
Figure 2
Panel A: Illustration of the Classical Bias-Variance Tradeoff in Underparametrized Models. Panel B: Illustration of “Double Descent” in Highly Overparametrized Models
Note. For Panel A, the total predictive error is the sum of the (squared) bias plus the variance. The highest predictive accuracy (i.e., minimum prediction error) is achieved by comparatively simple models that introduce bias in exchange for reduced variance. For the scenario in Panel B, variance starts decreasing again once the interpolation threshold is passed, leading to an overall reduction in predictive error without a corresponding increase in model bias. In this illustration, the highest predictive accuracy is found in the overparametrized region, but this is not always the case in practice. In both panels, N is the size of the training dataset.
Tradeoffs between predictive accuracy and interpretability occur when models grow so complex that they are difficult (if not impossible) to parse and understand (see Boge, 2022; Breiman, 2001; James et al., 2021). While underparametrized models can become fairly large and intricate, tradeoffs of this kind are the rule when dealing with overparametrized models, which have more parameters than the number of training data points. As it has become clear over the last few years, highly overparametrized models—most notably neural networks, random forests, and other ensemble algorithms—often manage to “escape” the classical bias-variance tradeoff, and achieve low levels of bias while also avoiding overfitting.2 This happens when, as models grow more complex, variance first increases—as in the classical scenario—but then begins to decrease once the number of parameters exceeds that of data points (Figure 2B; see Belkin et al., 2019; Dar et al., 2021; Hastie et al., 2019; Yang et al., 2020). In many cases, the overall result is the pattern illustrated in Figure 2B and known as double descent: the total prediction error first decreases, then keeps increasing right up to the interpolation threshold (i.e., the point at which the model has as many parameters as the training data points), and then decreases again once the threshold is passed. Depending on the specifics of the method and the data, the models with the best predictive performance may be found in the overparametrized region rather than at the classical “sweet spot” of the underparametrized regime (Belkin et al., 2019; Dar et al., 2021; James et al., 2021).
To the extent that overparametrized models achieve low bias, they can be said to possess explanatory accuracy; however, their structure tends to be opaque and inscrutable, to the point that one needs specialized techniques (developed under the rubric of ML “explainability” and/or “interpretability”) to extract usable information and try to explain how they work (e.g., Biecek & Burzykowski, 2021; Linardatos et al., 2021; Molnar, 2019). Stated differently, these models manage to internally represent the structure of the underlying phenomenon, but typically do so in a convoluted and massively unparsimonious form. This greatly limits their explanatory value as models of the phenomenon, even when they can be used to extract usable information by indirect means. Indeed, explainability/interpretability methods typically attempt to explain the behavior of an overparametrized “black-box” model (e.g., figure out how the model arrives at its predictions) rather than explain the phenomenon per se (Linardatos et al., 2021). Some of these methods seek to replace the original model with a simpler and more transparent one (see the later section on surrogate models); by doing so, they fall back into the classical bias-variance tradeoff. In this regard, one should note that, when a domain is characterized by well-defined and meaningful variables, simple and interpretable models (e.g., logistic regression) often perform similarly to neural networks and other overparametrized black boxes. Tradeoffs between predictive accuracy and interpretability are by no means inevitable, and tend to arise more often with certain types of models and problems than others (see Rudin, 2019).
To summarize: the tension between predictive and explanatory accuracy is especially severe in underparametrized models (under the classical bias-variance tradeoff), whereas the key contrast in overparametrized models is that between predictive accuracy and interpretability (Figure 1). Note that the number of free parameters—important as it is—provides only a partial picture of a model’s complexity and flexibility, which depend more generally on the structure and functional form of the relations among variables. These aspects of a model contribute to the tradeoffs between prediction and explanation even when they are not fully captured by the number of parameters. From a broader perspective, a variable may contribute to accurate prediction for reasons that have nothing to do with its theoretical importance or causal role in the explanation of a phenomenon. For example, some variables may contribute to prediction more than others merely because they are measured with less error. Or, it is entirely possible for the same variable to improve the performance of a predictive model, but seriously distort the results of an explanatory model (for example because it acts as a collider for the effect of interest; see Elwert & Winship, 2014; Rohrer, 2018). Correctly representing the causal relations between variables is essential for scientific explanation, but unnecessary and usually irrelevant for prediction (see Srećković et al., 2022). For these reasons, predictive and explanatory accuracy may come at each other’s expense regardless of the complexity and number of parameters of the models employed.3
Researchers commit the prediction-explanation fallacy when they overlook the tradeoffs between prediction and explanation, and uncritically use prediction-optimized models as explanations of the underlying phenomena. Note the word “uncritically”: using predictive models for explanatory purposes is not necessarily a problem; there are circumstances in which this approach is justified, and methods that allow researchers to circumvent the problem or at least reduce its severity (more on this below). For the same reasons, committing the fallacy does not automatically invalidate one’s analysis, nor does it mean that one’s interpretation of the results is necessarily wrong; it is possible to follow an invalid or unwarranted chain of inference, and still reach a correct conclusion.4 The point is the tension between prediction and explanation should be explicitly taken into account by researchers, and addressed on a case-by-case basis.
Regularization as a Source of Bias
The fact that biased, oversimplified models such as unit-weighted regression can outperform their classical counterparts when used for prediction has been known for a long time (see e.g., Dawes, 1979; Hagerty & Srinivasan, 1991). Newer regularization techniques (for example the LASSO and elastic net; see Hastie et al., 2009; James et al., 2021) offer powerful ways to shrink (i.e., strategically bias) the model coefficients while simultaneously selecting an optimal subset of variables. They do so by making assumptions of sparsity—which, simply stated, means that most of the effects of interest (e.g., those measured by the coefficients of a regression model) are assumed to be zero in the population. The underlying rationale is that sparse models (i.e., models with relatively few nonzero coefficients) tend to perform well in prediction even if the data-generating process is not actually sparse (this is known as the “bet on sparsity”; see Hastie et al., 2009, pp. 610–611). Of course, if the data-generating process does happen to be sparse, these techniques can effectively filter out sampling noise and yield models with high predictive and explanatory accuracy;5 but this is not true in general, and researchers routinely apply regularization to domains in which sparsity is not a plausible assumption. In sum, regularization (especially when it involves sparsity) can be a major source of bias in predictive models, and this should be taken into account when evaluating their adequacy as explanations.
The same logic applies to models whose purpose is not to predict a specific outcome but to describe relations among sets of variables. For example, network models have become quite popular in psychopathology and personality psychology, where they are used to investigate the relations between multiple symptoms, traits, and/or behaviors (se Costantini et al., 2015; Epskamp et al., 2018; McNally, 2016). Researchers in these fields typically use specialized versions of the LASSO to reduce the number of nonzero connections (edges) in the estimated networks. This is done with two distinct purposes: the first is to eliminate “spurious” edges and simplify the interpretation of the results (i.e., increase explanatory accuracy and interpretability); the second is to improve the generalizability of the estimated networks across different samples (analogous to out-of-sample prediction; see Epskamp et al., 2017; Epskamp & Fried, 2018).
Unfortunately, these goals can be in conflict with one another. Since the LASSO is based on the assumption of sparsity, it tends to return sparse networks regardless of the underlying structure of the variables, especially when sample size is small for the number of parameters in the model; thus, finding a sparse network after regularization is not convincing evidence that the true structure is actually sparse (Epskamp et al., 2017; Epskamp & Fried, 2018). Regularized networks can successfully recover the underlying structure of the variables even in relatively small samples (and thus achieve high explanatory accuracy in addition to generalizability) if that structure happens to be sparse; otherwise, they may introduce significant biases and suggest distorted explanations of the phenomenon under study. Failing to understand this problem can lead researchers to commit a variant of the prediction-explanation fallacy, as they use regularized networks to describe the structure of the variables without considering the biases they introduce.
Many Models, Many Explanations: The Rashomon Effect
The tension between prediction and explanation has an important corollary, known as the Rashomon effect (Breiman, 2001). For a given prediction problem, there is usually a multitude of models that predict about equally well; but each model may tell a somewhat different story about which predictors are important and/or how they are related (potentially including different functional forms for their relationships), especially if the dataset includes a large number of mutually correlated variables. In other words, the models are equally good for prediction, but suggest different, mutually inconsistent explanations of the same phenomenon (see also Dong & Rudin, 2020; Fisher et al., 2019; Hancox-Li, 2020). This should not come as a surprise; the point is that different models may achieve the same predictive performance by implementing different biases (i.e., finding different but equally useful ways to be “wrong”). For a striking demonstration of the multiplicity of high-performing models, one can see the extensive study by Fernández-Delgado et al. (2014), who tested the performance of 179 classifiers from 17 families of ML models on a collection of 121 real-world datasets. In the vast majority of the datasets, there were dozens of classifiers that performed equally well or within a narrow margin.6 More recently, D’Amour and colleagues (2020) investigated the Rashomon effect and its implications in a variety of ML applications, from computer vision to natural language processing (see also D’Amour, 2021).
The Rashomon effect comes in two flavors. On the one hand, models based on different algorithms and functional forms (e.g., logistic regression, classification trees, neural networks) can perform very similarly to one another when trained and tested on the same data. On the other hand, even using a single type of model can yield unstable results, because small changes in the data or in the tuning parameters can have a dramatic impact on which variables get selected and/or on the model coefficients (Breiman, 2001). For example, regularization techniques deal with multiple redundant variables by excluding some of them from the model, or shrinking their coefficients by a large amount; but precisely which variables end up being excluded or deemphasized in a particular model may depend on minor fluctuations of the data.
A notable consequence of the Rashomon effect is that when different types of models are trained on the same dataset, they may easily identify different sets of variables as being “important” for prediction. Even training the same type of model on similar datasets may yield contradictory accounts of the importance of variables—not because the phenomenon under study lacks consistency, but because prediction-optimized models tend to be unstable (in the sense explained earlier).7 When predictive models trained on the same or similar data appear to suggest markedly inconsistent explanations of a phenomenon, it can be tempting to conclude that the phenomenon itself is not robust; however, this is just an insidious manifestation of the prediction-explanation fallacy in the context of multiple models. The underlying phenomenon may or may not be robust; but there is no way to know based solely on the observed inconsistency between alternative models.
Explanation and Prediction in Surrogate Models
In the attempt to understand the workings of opaque black box models, researchers sometimes use global “surrogate models” (see e.g., Molnar, 2019). A global surrogate is just a simpler, interpretable model that is trained to predict the predictions of the original model. For example, imagine that a complex neural network was trained on a dataset to predict a binary outcome. Researchers could then train a logistic regression model on the same dataset, but instead of predicting the original outcome, the surrogate would be trained to predict the predictions made by the neural network. This simpler model could then be probed for insights into the workings of the original model. A global surrogate aims to reproduce the output of an entire model, in contrast with “local surrogates” that focus on individual predictions and try to explain how the model arrived at them (Biecek & Burzykowski, 2021; Linardatos et al., 2021; Molnar, 2019). Of course, the success of this strategy depends on the ability of the surrogate to approximate the key functional relations between variables in the original model.
What is easily missed is that surrogate models are subject to all the tradeoffs discussed in this section; thus, improving the predictive accuracy of a surrogate model—for example by regularization and/or cross-validation—will tend to make it less accurate as an explanation of the original model. This is a problem because the purpose of surrogate models is intrinsically explanatory. Moreover, training multiple surrogate models on the same or similar data can give rise to Rashomon effects: different surrogates may seem to explain the original model(s) about equally well, but suggest multiple, inconsistent explanations of how they work. Overlooking these issues can lead to “second-order” instances of the prediction-explanation fallacy, which may be particularly hard to detect and correct.
Illustrative Examples
Sex Differences in Brain Structure
In recent years, there has been a surge of studies employing predictive ML methods to distinguish between males and females based on their brain anatomy (e.g., data on cortical volume, thickness, or three-dimensional morphology). Classification accuracy is typically above 90%, but drops to 60–70% when differences in total brain volume are controlled for (reviewed in Eliot et al., 2021; see also Lao et al., 2004; Tunç et al., 2016; van Eijk et al., 2021). In view of the high predictive accuracy achieved by these models, it can be tempting to use them to determine which regions of the brain contribute the most to differentiating males and females—a fertile ground for the prediction-explanation fallacy in all its forms.
For a clear-cut example of the fallacy, consider the study by Luo et al. (2019). These authors aimed to answer two questions: “(a) can gender be discriminated with a high accuracy using cortical 3-D morphology? (b) What is the most discriminative region of gender in cortical 3-D morphology?” (p. 2). To this end, they trained a hierarchical sparse representation classifier on cortical morphology data, achieving 97% accuracy. Then, they used bootstrapped model weights to identify “important 3-D morphological features in gender discrimination” (p. 7). A brain map of the discriminative regions (see Luo et al., 2019, Figure 4) showed a highly sparse configuration; this is not surprising, given the strong sparsity assumptions built into the model (pp. 4–5). This study exemplifies the fallacy because the authors went straight from training a predictive model to making statements about the most important differences between male and female brains, e.g., “The main morphology difference for gender exists mainly in the frontal lobe and the limbic lobe, others scattered in the parietal lobe, the temporal lobe, the corpus callosum and the precuneus” (p. 7). They did not discuss how their modeling decisions might have biased the analysis, or caution readers against incorrect interpretations of their findings. It is important to stress that the prediction-explanation fallacy does not lie in the methodology of a study per se, but in the use and interpretation of the results. It can also be useful to restate that committing the fallacy does not automatically invalidate the results of a study; however, it does raise questions about their interpretation, and may challenge the validity of the explanatory inferences drawn by the authors.
A neuroimaging study by Anderson et al. (2019) illustrates a diametrically opposite approach to the tradeoffs between prediction and explanation. These authors applied independent component analysis (ICA) to cortical volume and density data, and used the resulting components to train and compare a number of predictive models. Logistic regression and support vector machines (SVM) performed best, with a classification accuracy of 93%. However, the authors did not plot the model weights or use them to identify brain regions that discriminate between males and females; instead, they presented descriptive maps of sex-differentiated regions based on the results of ICA (see Anderson et al., 2019, Figures 1 and 2). This study avoided the prediction-explanation fallacy by restricting the use of predictive models to the classification task. Note that this is not necessarily the optimal strategy; depending on context, careful consideration of the weights of predictive models can provide useful information and complement the results of other analyses. Another possibility is to deliberately fit different types of models to the data, some optimized for prediction and others for accurate description/explanation. For example, Sepehrband et al. (2018) analyzed sex differences in cortical structure with two models—a prediction-optimized SVM and a standard general linear model (GLM)—and explicitly compared their results, while taking care to note the different goals of the two analyses. Although model weights were generally concordant, several regions showed high discriminatory power in the SVM but no significant sex differences in the GLM (see Sepehrband et al., 2018, Table 2). Those regions could be promising candidates for follow-up analyses, because the SVM algorithm might have picked up complex interaction patterns that would have been missed by the simple GLM used in the study.
As I noted earlier, the prediction-explanation fallacy does not only apply to the results of individual models, but also to comparisons between multiple models and studies. As part of their critical review of the research on sex differences in the brain, Eliot et al. (2021) compiled a dozen of studies that had used ML methods to predict a person’s sex from patterns of brain structure and function (see Eliot et al., 2021, Table 7). They noted:
“[T]he studies […] differ strikingly in features found to be most important for [sex/gender] classification accuracy. Of course, one would not expect similar features to emerge between studies using qualitatively different data, such as rsfMRI activity versus regional gray matter volumes. But even among studies that relied exclusively on structural measures, we see a lack of replication among the brain regions identified as most important for male/female classification across studies” (p. 681).
And concluded:
“[T]his discrimination is largely based on brain size and there is no agreement about local features that are most important for distinguishing male versus female types. The lack of hallmark ‘male’ versus ‘female’ brain features is likely because each algorithm was custom-developed for its particular dataset. […] These findings challenge the notion that there exists a discrete set of variables that capture core differences between male and female brains across the human species.” (p. 681).
However, variability in the “important features” identified across models and samples—even with similar data and similar levels of predictive accuracy—may easily arise as a manifestation of the Rashomon effect, and cannot be used to draw simple conclusions about the existence (or lack thereof) of reliable differences between male and female brains. While local overfitting may have contributed to inflate the variability of these findings (as hinted at in the passage above), one should not expect high levels of consistency to begin with; in fact, more aggressive strategies to reduce overfitting may even exacerbate the Rashomon effect instead of reducing it (see Breiman, 2001; Schaffer, 1993). Note that I am not arguing that Eliot et al.’s conclusions are necessarily wrong; my point is that they are not warranted by the observation that different models rely on different sets of brain features for prediction.
My last example for this section is an interesting, widely circulated preprint by Sanchis-Segura et al. (2021).8 These authors explored the issue of sex differences in brain structure with a variety of descriptive and predictive methods. In one of the analyses, they used a dataset of gray matter volume to train five different classifiers: two “classical” models fit without regularization or cross-validation (logistic regression and linear discriminant analysis [LDA]), and three prediction-optimized models (SVM, random forests, and multiple adaptive regression splines [MARS]). The classifiers achieved similar levels of accuracy (86–90% without correcting for total brain volume, 59–66% in a volume-corrected dataset), and were used to generate five estimates of the “probability of being classified as male” (PCAM) for each participant. Then, the authors trained another set of predictive models (boosted beta regression) that used regional volumes to predict the PCAM scores generated by the classifiers, and compared regression weights across the resulting models. In other words, the authors used beta regression models as global surrogates, to identify the brain features that contributed most to prediction in the original classifiers and quantify their relative importance. Despite the high correlations between the five PCAM scores (the average correlation was .87 without correcting for total brain volume, .70 in a corrected dataset; see Sanchis-Segura et al., 2021, Figure 7D), the relative importance of different brain regions varied substantially across classifiers, yielding low levels of consistency by most measures (see Sanchis-Segura et al., 2021, pp. 6–7, Figures 5 and 6).9
The authors correctly noted that “because they differ in their statistical assumptions and operations, distinct algorithms rely on distinct brain features […] and assign different PCAM scores to the same subjects” (p. 7). But then they went on to write:
“Therefore, it is apparent that—despite working with identical data from the same individuals—the different algorithms tested in the present study do not provide directly exchangeable outcomes or identify a single, coherent, and reproducible subset of brain features as the source of the males-females multivariate differences […] Together, these sources of empirical evidence directly challenge the binary sex views of human brains […] these views assume that, because distinct ML algorithms are able to correctly ‘predict’ sex from neuroanatomical features in 80–90% of the cases, all these algorithms must be identifying two distinct brain types in the human species, one typical of males and the other typical of females […]. However, these universal ‘brain types’ do not seem to really exist, given that different algorithms identify distinct brain features as the landmarks of ‘male/ female brains’ in different samples of females and males and when applied to the same subjects” (pp. 7–8).
In this interpretation of the results, the authors committed two instances of the prediction-explanation fallacy. First and more obviously, they read the lack of concordance among models as evidence against the existence of universal male/female “brain types”; because different classifiers can be expected to rely on different sets of predictors, even when trained on the same data, this inference is unwarranted.10 The second fallacy concerns the use of boosted beta regression to infer the relative importance of brain features according to different classifiers. This algorithm uses gradient boosting and cross-validation to select an optimal subset of variables for prediction (Schmid et al., 2013). The resulting surrogate models can be expected to maximize predictive performance at the expense of explanatory accuracy, and to show a degree of instability when faced with many redundant predictors. In other words, different regression models trained on similar PCAM scores may select somewhat different sets of “important predictors” for reasons that have nothing to do with a lack of consistency between the original classifiers. It remains unclear to what extent the discordance on display in Sanchis-Segura et al. (2021), Figures 5 and 6 is due to actual differences among the classifiers, or to instability in the regression models used to explain their functioning. These surrogates identified similar sets of important predictors for logistic regression and LDA, which is encouraging given the strong similarity between these algorithms (see James et al., 2021). At the same time, the PCAM scores produced by the two classifiers correlated at .99, making this a limit case of almost perfect consistency.
Neural Correlates of Emotions
Most biological theories of emotions (e.g., Ekman, 1999; Panksepp, 1998) postulate the existence of brain mechanisms specialized to produce specific emotional responses. According to these theories, the experience of different emotions—such as happiness, anger, or disgust—should correlate with somewhat distinctive patterns of brain activity (“neural signatures”). The existence of such signatures should make it possible to accurately predict the emotional state of a person from measures of his/her brain activity. By applying ML methods to functional neuroimaging data, researchers have been able to classify participants’ emotional states into discrete categories with significant accuracy (e.g., Kassam et al., 2013; Kragel & LaBar, 2015; Saarimäki et al., 2016).11
Some of the studies in this area can serve to illustrate the prediction-explanation fallacy in its subtler forms. For example, Kragel and LaBar (2015) trained a set of partial least square discriminant analysis models, and employed cross-validation to select the number of latent variables. They then used model coefficients to identify the brain voxels that contributed most strongly to predicting each specific emotion (see Kragel & LaBar, 2015, Figure 3). The authors noted that the voxels selected by the predictive models overlapped only in part with those that showed significant differences in univariate GLMs; however, they did not discuss the respective biases and limitations of the two types of models, and presented their results in a way that blurred the line between prediction and explanation. For instance:
“Our analysis of regression coefficients revealed that this information was contained within diverse patterns of activation, spanning a number of cortical and subcortical brain regions. […] Maps for contentment included precuneus, medial prefrontal, cingulate and primary somatosensory cortices […] Despite engaging partially overlapping neural substrates at the macro-scale, emotion-predictive patterns were largely non-overlapping at the voxel level. Such separability of emotional states at the voxel level may explain why meta-analytic works […] have associated neural activity with discrete emotions (e.g., correspondence between activation within the amygdala and fear or dorsal anterior cingulate and happiness), yet have failed to consistently identify emotion-specific neural substrates” (pp. 1444–1445).
In a similar study, Saarimäki et al. (2016) trained neural networks to classify emotional experiences into discrete categories, and used model weights to calculate and plot the “importance values” associated with each brain voxel (see Saarimäki et al., 2016, Figure 3). Even if they did not discuss the tradeoffs between prediction and explanation, these authors were generally careful to explicitly frame their results in terms of prediction, and managed to avoid confusions between the two domains throughout most of the paper. However, both in the title (“discrete neural signatures of basic emotions”) and at various points of the discussion they seemed to equivocate between the sets of predictive voxels used by the neural networks and the broader (explanatory) concept of neural signatures. For example:
“Our results reveal that basic emotions are supported by discrete neural signatures within several brain areas, as evidenced by the high classification accuracy of emotions from hemodynamic brain signals. […] The distributed emotion-specific activation patterns may provide maps of internal states that correspond to specific subjectively experienced, discrete emotions […] In our study, the medial prefrontal and medial posterior regions […] contributed most significantly to classification between different basic emotions […]. Thus, local activation patterns within these areas differ across emotions and thus presumably reflect distinct neuronal signatures for different emotions” (pp. 7–8).
Lacking an explicit discussion of the explanatory limits of the analysis, the readers of this paper may easily look at the results and draw unwarranted conclusions about the “signatures” of different emotions and their localization.
In contrast with classical emotion theorists, constructivist scholars deny the existence of specialized emotion mechanisms in the brain. An influential example of this approach is the theory of constructed emotion (Barrett, 2017). According to the theory, emotions consist of complex and highly variable patterns of sensory inputs, interoceptive sensations, facial movements, and so on; these patterns do not arise from the activity of dedicated mechanisms, but instead get categorized as instances of a certain emotion through the incessant concept-forming activity of the brain. One implication of this view is that emotions should not be associated with specific neural signatures. In reviewing the empirical data in support of the theory, Barrett (2017) wrote:
“Ironically, perhaps the strongest evidence to date for the theory comes from studies that use pattern classification to distinguish categories of emotion. Several recent articles taking this approach have reported success in differentiating one emotion category from another—a finding that is routinely construed as providing the long awaited support for the classical view (Kassam et al., 2013; Kragel and LaBar, 2015; Saarimäki et al., 2016). However, patterns that distinguish among the categories in one study do not replicate in the other studies” (p. 15).
The deeper irony of this passage is that the author’s interpretation of the literature is a conspicuous example of the prediction-explanation fallacy. The regions/voxels identified as most predictive can be expected to vary from one study to the next, even if the underlying patterns of brain activity are stable and consistent. (Note that the variability is going to be magnified if the studies are based on small samples, as in this case.) Discordances between prediction-optimized models are par for the course; it is a mistake to conclude that a phenomenon lacks robustness just because different predictive models fail to agree with one another. Of course, this does not automatically count as a vindication of classical theories, or a falsification of Barrett’s theory of constructed emotions (which might well be the better model of how emotions work); but it does challenge the notion that the theory receives strong support from the sheer variability of brain imaging results.
Mitigating Factors and Strategies
Throughout this paper I have emphasized the differences and tradeoffs between prediction and explanation, but it is important to stress once again that these goals are not always or necessarily in tension.12 The biases introduced to maximize prediction accuracy can also improve explanatory accuracy if they happen to match the structure of the phenomenon under study. For example, algorithms that make sparsity assumptions can generate accurate explanations when the data-generating process is actually sparse (e.g., Epskamp et al., 2017). Thus, using a prediction-optimized model for explanatory purposes can be justified if there are theoretical and/or empirical reasons to believe that the assumptions of the model are closely matched to the structure of the underlying phenomenon. Indeed, strategically introducing biases of this kind can also improve the explanatory accuracy of models that are not intended or used for prediction at all (see also Footnote 5). From a complementary angle, it may be possible to gain insight into the structure of a phenomenon precisely by comparing the performance of alternative models that incorporate different assumptions and biases (e.g., sparsity vs. density; see Yarkoni & Westfall, 2017). Needless to say, this kind of comparison must be approached with great care to avoid the risk of other interpretive fallacies. One should also consider that regularization has a stronger impact when sample size is small for the number of model parameters; thus, some of the tradeoffs I discussed here tend to become less severe when working with large datasets.
As I noted earlier, one strategy that can be used to avoid the prediction-explanation fallacy is to fit different types of models to the data, some optimized for prediction and other for accurate explanation. If done with care, comparisons between different types of models can be illuminating, and may even act as springboards for new and better theories of a phenomenon (Hofman et al., 2021). Another way of lessening the problem is to avoid focusing on a single best-performing model, and instead capitalize on the Rashomon effect by training and examining a set of well-performing models, each with somewhat different explanatory biases (see e.g., Sanchis-Segura et al., 2022). In this regard, ML researchers have begun to develop specialized methods to systematically explore model variability across the so-called “Rashomon set” for a predictive task, that is, the full set of (approximately) equally accurate models of a given class. For instance, Fisher et al. (2019) proposed to measure the range of importance assumed by each variable across models (model class reliance). Another recent example is the work on variable importance clouds (Dong & Rudin, 2020), a visualization technique that maps the importance of each variable across the models in the set. Variable importance clouds go beyond aggregate estimates of importance and can reveal the existence of explanatory tradeoffs between variables, so that when one of the variables has high importance in a model, the other tends to have low importance (and vice versa). For a discussion of some conceptual complications in the analysis of variable importance measures, see Watson (2022).
Crucially, the mitigating strategies discussed above rely on statistical indices that are blind to the causal structure of the data. But as I noted earlier, a variable can be singled out as an “important” predictor even if it plays a spurious role in the true causal explanation of the phenomenon, or possibly just because it has been measured with less error than other, more causally meaningful variables. The only antidote to these threats to accurate explanation is explicit causal reasoning, which can be aided by the rapidly expanding toolkit of formal causal modeling (see Pearl et al., 2016; Pearl & Mackenzie, 2018; Rohrer, 2018; Wiedermann & Von Eye, 2016; see also Zhao & Hastie, 2021 for some considerations about the causal interpretation of black-box ML models).
Conclusion
Scientific applications of ML techniques can be extremely powerful; they also raise new problems and complications, both in their use and in the interpretation of their results. One of these problems—which seems to be particularly widespread—is the uncritical use of prediction-optimized models for explanatory purposes. Here I tried to pinpoint this fallacy, explain it in simple terms, and give it a convenient and descriptive name. The prediction-explanation fallacy can take a number of related forms; it can range from mild, ambiguous cases to glaring misinterpretations of the information provided by predictive models. The solution is not to prescribe that one should never use predictive models for explanation; that would be just a different sort of fallacy. Instead, researchers should explicitly address the tension between explanation and prediction in their analyses, consider potential mitigating factors, and—when feasible—use appropriate strategies to limit or circumvent the problem.
Raising awareness about the prediction-explanation fallacy will become especially critical as more psychologists and neuroscientists begin to follow Yarkoni and Westfall’s (2017) exhortation to put prediction at the center of their work. As demonstrated by the examples I reviewed here, it is easy to start with a purely predictive question but then unwittingly slip back into a “default” explanatory mode, without clearly realizing the implications and potential pitfalls. I hope this paper will contribute to improve the applied use of ML by helping researchers run more transparent, informative analyses and avoid drawing misleading conclusions from the data.