Tutorial

Bayesian Tests of Two Proportions: A Tutorial With R and JASP

Tabea Hoffmann^1,*, Abe Hofman², Eric-Jan Wagenmakers²

[1] Faculty of Economics and Business & Faculty of Spatial Sciences, University of Groningen, Groningen, The Netherlands. [2] Psychological Methods Group, University of Amsterdam, Amsterdam, The Netherlands.

Methodology, 2022, Vol. 18(4), 239–277, https://doi.org/10.5964/meth.9263

Received: 2022-04-04. Accepted: 2022-09-05. Published (VoR): 2022-12-22.

Handling Editor: Isabel Benítez, University of Granada, Granada, Spain

*Corresponding author at: Faculty of Economics and Business & Faculty of Spatial Sciences, Nettelbosje 2, University of Groningen, L9747 AE Groningen, The Netherlands. E-mail: t.hoffmann@rug.nl

This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

The need for a comparison between two proportions (sometimes called an A/B test) often arises in business, psychology, and the analysis of clinical trial data. Here we discuss two Bayesian A/B tests that allow users to monitor the uncertainty about a difference in two proportions as data accumulate over time. We emphasize the advantage of assigning a dependent prior distribution to the proportions (i.e., assigning a prior to the log odds ratio). This dependent-prior approach has been implemented in the open-source statistical software programs R and JASP. Several examples demonstrate how JASP can be used to apply this Bayesian test and interpret the results.

Keywords: Bayes factor, Bayesian estimation, contingency tables, log odds ratio

The comparison of two sample proportions is also known as an A/B test (Little, 1989). The statistical framework requires that there are two groups and that the data for each participant are dichotomous, e.g., 'correct–incorrect', 'infected–not infected', or 'watched the commercial–did not watch the commercial'.

The A/B test for proportions (henceforth 'the A/B test') is standard operating procedure for the analysis of clinical trial data: study participants are randomly allocated to one of two experimental conditions (which are often called group A and group B). One of the conditions is usually the control condition (e.g., a placebo), while the other condition introduces an intervention (e.g., a new pharmacological drug). In each condition, participants are evaluated on a dichotomous measure such as 'dead–alive', 'side effect–no side effect', etc. The goal of the experiment is to examine the treatment success of the intervention. Because of its general nature, the A/B test is also common in fields such as biology and psychology.

Another field that has recently adopted A/B testing—for so-called 'conversion rate optimization'—is online marketing. A conversion rate optimization experiment proceeds analogously to the classical experiment: two versions of the same website are shown to different selections of website visitors and the number of visitors who take a desired action (e.g., clicking on a specific button) is monitored.

Whenever the A/B test is applied, practitioners eventually wish to know whether and to what extent the experimental condition has a higher success rate than the control condition. This judgment requires that the observed sample difference in proportions is translated to the population—that is, the judgment requires statistical inference.

In general, practitioners can choose between the frequentist and the Bayesian frameworks for their statistical analysis. We subscribe to the three general desiderata for inference in the A/B test as outlined in Gronau, Raj K. N., and Wagenmakers (2021): ideally, (1) evidence can be obtained in favor of the null hypothesis; (2) evidence can be monitored continually, as the data accumulate; and (3) expert knowledge can be taken into account. Below we briefly explain why these desiderata are incompatible with the framework of frequentist statistics; next we turn to the Bayesian framework and examine two different Bayesian instantiations of the A/B test, which we then apply to two examples. This article is a tutorial, and consequently we will emphasize the assumptions, interpretations, and practical application of the test. A more advanced discussion of the software can be found in Gronau et al. (2021), and an associated statistical paradox is presented in Dablander, Huth, Gronau, Etz, and Wagenmakers (2022).

Frequentist Statistics

Practitioners predominantly use p-value null-hypothesis significance testing (NHST) to analyze A/B test data. However, the standard NHST approach does not satisfy the three desiderata mentioned above. Firstly, standard NHST results cannot distinguish between absence of evidence and evidence of absence (Keysers, Gazzola, & Wagenmakers, 2020; Robinson, 2019). Evidence of absence means that the data support the hypothesis that there is no effect (i.e., the two conditions do not differ); absence of evidence, however, means that the data are inconclusive (Altman & Bland, 1995). Secondly, in standard NHST the data cannot be tested sequentially without necessitating a correction for multiple comparisons that depends on the sampling plan (see for instance Berger & Wolpert, 1988; Wagenmakers, 2007; Wagenmakers et al., 2018). Especially in clinical trials but also for online marketing is it efficient to act as soon as the data provide evidence that is sufficiently compelling. To achieve such efficiency many A/B test practitioners repeatedly peek at interim results and stop data collection as soon as the p-value is smaller than some predefined α-level (Goodson, 2014). However, this practice inflates the Type I error rate and hence invalidates an NHST analysis (Jennison & Turnbull, 1990; Wagenmakers, 2007). Thirdly, standard NHST does not allow users to incorporate detailed expert knowledge. For example, among conversion rate optimization professionals it is widely known that online advertising campaigns often yield minuscule increases in conversion rates (cf. Johnson, Lewis, & Nubbemeyer, 2017; Patel, 2018). Such knowledge may affect NHST planning (i.e., knowledge that the effect is minuscule would necessitate the use of very large sample sizes), but it is unclear how it would affect inference.¹ As we will see below, in the Bayesian framework is it conceptually straightforward to enrich statistical models with expert background knowledge, thereby resulting in more informed statistical analyses (Lindley, 1993).

It should be acknowledged, however, that non-standard (i.e., less popular) forms of frequentist analyses exist that alleviate some of the concerns listed above. For instance, sequential inference can be carried out by the Sequential Probability Ratio Test (Schnuerch & Erdfelder, 2020; Wald, 1945) or by Safe Testing (e.g., Grünwald, de Heide, & Koolen, 2021). In addition, it has been argued that evidence of absence can be obtained by means of an equivalence test, in which the null hypothesis is defined as an effect size that falls outside of a region of practical interest (e.g., King, 2011; Tango, 1998). An in-depth discussion of the pros and cons of frequentist inference is beyond the scope of this article.

Bayesian Statistics

The limitations of standard frequentist statistics can be overcome by adopting a Bayesian data analysis approach (e.g., Deng, 2015; Kamalbasha & Eugster, 2021; Stucchio, 2015). In Bayesian statistics, probability expresses a degree of knowledge or reasonable belief (Jeffreys, 1961) and in principle Bayesian statistics fulfills all three desiderata listed above (e.g., Wagenmakers et al., 2018). In the next sections we introduce two approaches to Bayesian A/B testing. The two approaches make different assumptions, ask different questions, and therefore provide different answers (cf. Dablander et al., 2022).

The ‘Independent Beta Estimation (IBE) Approach’

Let n_A denote the total number of observations and y_A denote the number of successes for Group A. Let n_B denote the total number of observations and y_B denote the number of successes for Group B. The commonly used Bayesian A/B testing model is specified as follows:

\begin{array}{l} y_{A} \sim B i n o m i a l (n_{A}, θ_{A}) \\ y_{B} \sim B i n o m i a l (n_{B}, θ_{B}) \end{array}

This model assumes that y_A and y_B follow independent binomial distributions with success probabilities θ_A and θ_B. These success probabilities are assigned independent beta(α, β) distributions that encode the relative prior plausibility of the values for θ_A and θ_B. In a beta distribution, the α values can be interpreted as counts of hypothetical ‘prior successes’ and the β values can be interpreted as counts of hypothetical ‘prior failures’ (Lee & Wagenmakers, 2013):

\begin{array}{l} θ_{A} \sim B e t a (α_{A}, β_{A}) \\ θ_{B} \sim B e t a (α_{B}, β_{B}) \end{array}

Data from the A/B testing experiment update the two independent prior distributions to two independent posterior distributions as dictated by Bayes’ rule:

\begin{array}{l} p (θ_{A} | y_{A}, n_{A}) = \frac{p (θ_{A}) \times p (y_{A}, n_{A} | θ_{A})}{p (y_{A}, n_{A})} \\ p (θ_{B} | y_{B}, n_{B}) = \frac{p (θ_{B}) \times p (y_{B}, n_{B} | θ_{B})}{p (y_{B}, n_{B})} \end{array}

where p(θ_A) and p(θ_B) are the prior distributions and p(y_A,n_A | θ_A) and p(y_B,n_B | θ_B) are the likelihoods of the data given the respective parameters. Hence, the reallocation of probability from prior to posterior is brought about by the data: the probability increases for parameter values that predict the data well and decreases for parameter values that predict the data poorly (Kruschke, 2013; van Doorn, Matzke, & Wagenmakers, 2020; Wagenmakers, Morey, & Lee, 2016). Note that whenever a beta prior is used and the observed data are binomially distributed, the resulting posterior distribution is also a beta distribution. Specifically, if the data consist of s successes and f failures, the resulting posterior beta distribution equals beta(α + s, β + f) (Gelman et al., 2013; van Doorn et al., 2020).² Ultimately, practitioners are most often interested in the difference δ = θ_A − θ_B between the success rates of the two experimental groups, as this difference indicates whether the experimental condition shows the desired effect.

R Implementation of the IBE Approach: The `bayesAB` Package

The IBE approach is implemented for instance in the bayesAB (version 1.1.3, Portman, 2017) package in R (version 4.2.1, R Core Team, 2020). Consider the following fictitious example from ethology, inspired by the classic work of von Frisch (1914). A researcher wishes to test whether honey bees have color vision by comparing the behavior of two groups of bees. The experiment involves a training and a testing phase. In the training phase, the bees in the experimental condition are presented with a blue and a green disc. Only the blue disc is covered with a sugar solution that bees crave. The control group receives no training. In the testing phase, the sugar solution is removed from the blue disc, and the behavior of both groups is being observed. If the bees in the experimental condition have learned that only the blue disc contains the appetising sugar solution, and if they can discriminate between blue and green, they should preferentially explore the blue instead of the green disc during the testing phase. The researcher finds that in 65 out of 100 times, the bees in the experimental group continued to approach the blue disc after the sugar solution was removed. The bees that were not trained approached the blue disc 50 out of 100 times. In the remainder of this section, we will refer to the bees in the control condition as group A and to the bees in the experimental condition as group B. The R file for this fictitious example can be found in the Supplementary Materials.

Before setting up this A/B test and collecting the data, the prior distribution has to be specified so that it represents the relative plausibility of the parameter values. For the present example, the researcher specifies two uninformative (uniform) beta(1,1) priors. After running the A/B test procedure, the priors are updated with the obtained data. With bayesAB the calculation of the posterior distributions is done by feeding both the priors and the data to the bayesTest function:

R> library(bayesAB)
R> bees1 <- read.csv2("bees_data1.csv")
R> AB1 <- bayesTest(bees1$y1, bees1$y2, 
+                   priors = c('alpha' = 1, 'beta' = 1), 
+                   n_samples = 1e5, distribution = 'bernoulli')

A more detailed explanation of the function and its arguments can be obtained by typing ?bayesTest into the R console. The results can be obtained and visualized by executing:

R> summary(AB1)
R> plot(AB1)

Figure 1 shows the two independent posterior distributions that plot(AB1) returns. To plot these posterior distributions, bayesTest makes use of the rbeta function that draws random numbers from a given beta distribution. To obtain each posterior distribution the package first exploits conjugacy: the number of successes s are added to the α values of either version’s prior distribution and the number of failures f are added to the respective β values (e.g., Kruschke, 2015; Kurt, 2019). Thus, the posterior distribution for θ_A is beta(α_A +s_A, β_A +f_A) and that for θ_B is beta(α_B +s_B, β_B +f_B). The rbeta function draws random samples from each posterior distribution and the density of these values is shown in Figure 1.³ We can see that group B’s posterior distribution for the success probability assigns more mass to higher values of θ. This suggests that the success probability of the trained bees is higher, which in turn implies that bees have color vision.

	Game Type
Coins	Non-Crown	Crown	Total
Yes	2272	906	3178
No	2596	625	3221
Total	4868	1531	6399

	Click on Button
Condition	Yes	No	Total
A	1131	8869	10000
B	1275	8725	10000
Total	2406	17594	20000

Bayesian Tests of Two Proportions: A Tutorial With R and JASP

Abstract

Frequentist Statistics

Bayesian Statistics

The ‘Independent Beta Estimation (IBE) Approach’

R Implementation of the IBE Approach: The bayesAB Package

Figure 1

Independent Posterior Beta Distributions of the Success Probabilities for Group A and B

Figure 2

Histogram of the Conversion Rate Uplift From Version A (i.e., 50/100) to Version B (i.e., 65/100) for the Fictitious Bee Data Set

Figure 3

Posterior Distributions of the Difference δ = θB – θA for the Fictitious Bee Data (i.e., A = 50/100 Versus B = 65/100)

Figure 4

Sequential Analysis of the Difference Between the Success Probabilities (i.e., δ = θB − θA) for the Fictitious Bee Data (i.e., A = 50/100 versus B = 65/100)

Assumptions of the IBE Approach

The Logit Transformation Testing (LTT) Approach

Implementation of the LTT Approach in R and JASP

Figure 5

Bayes Factor Robustness Plot for the Comparison Between H0 and H+ as Applied to the Fictitious Bee Data (i.e., A = 50/100 Versus B = 65/100)

Figure 6

The Flow of Posterior Probability for H0 and H+ as a Function of the Accumulating Number of Observations for the Fictitious Bee Data (i.e., A = 50/100 vs. B = 65/100)

Figure 7

Prior and Posterior Distribution of the Log Odds Ratio Under H1 as Applied to the Fictitious Bee Data (i.e., A = 50/100 vs. B = 65/100)

Figure 8

JASP (Version 0.16.3.0) Screenshot of the Summary Statistics Implementation of the abtest R Package

Example I: The Rekentuin

The Rekentuin A/B Experiment

Figure 9

Screenshots from the Rekentuin Web Environment

Method

Preregistration

Data Preprocessing

Results

Descriptives

Table 1

Rekentuin A/B Test: The IBE Approach

Figure 10

Independent Posterior Beta Distributions of the Success Probabilities of Playing a Non-Crown Game

Figure 11

Histogram of the Conversion Rate Uplift from Version A (i.e., 2272/3178) to Version B (i.e., 2596/3221) for the Rekentuin Data

Figure 12

Posterior Distribution of the Difference δ = θ B Non-Crown – θ A Non-Crown for the Proportion of Non-Crown Games Between the Two Rekentuin Website Versions

Figure 13

Sequential Analysis of the Difference Between the Success Probabilities (i.e., θ B Non-Crown − θ A Non-Crown ) of the two Rekentuin Versions

Rekentuin A/B Test: The LTT Approach

Figure 14

Bayes Factor Robustness Plot for the Rekentuin Data

Figure 15

The Flow of Posterior Probability for H0 and H+ as a Function of the Number of Observations Across Both Rekentuin Versions

Figure 16

Prior and Posterior Distribution of the Log Odds Ratio Under H1 for the Rekentuin Data Set

Example II: The Fictional Webshop

Table 2

The IBE Approach

Figure 17

Two Independent Posterior Distributions and Conversion Rate Uplift

Figure 18

Posterior Distribution of the Difference δ = θB − θA for the Click-Through Proportion Between the two Fictitious Website Versions

The LTT Approach

Figure 19

Sequential Analysis of the Difference Between the Click-Through Probabilities (i.e., θB − θA) of the two Fictitious Webshop Versions

Figure 20

Three Different Prior Distributions for the Analysis of the Fictitious Webshop Data

Figure 21

Bayes Factor Robustness Plot for the Fictitious Webshop Data

Figure 22

Flow of Posterior Probability for H0 and H+ as a Function of the Number of Observations Across Both Fictitious Website Versions.

Figure 23

Prior and Posterior Distribution of the Log Odds Ratio Under H1 for the Fictitious Webshop Data Set

Concluding Comments

Notes

Funding

Acknowledgments

Competing Interests

Data Availability

Supplementary Materials

Index of Supplementary Materials

References

Appendix

Key Statistical Results for the IBE Approach

R Implementation of the IBE Approach: The `bayesAB` Package

Posterior Distributions of the Difference δ = θ_B – θ_A for the Fictitious Bee Data (i.e., A = 50/100 Versus B = 65/100)

Sequential Analysis of the Difference Between the Success Probabilities (i.e., δ = θ_B − θ_A) for the Fictitious Bee Data (i.e., A = 50/100 versus B = 65/100)

Bayes Factor Robustness Plot for the Comparison Between H₀ and H₊ as Applied to the Fictitious Bee Data (i.e., A = 50/100 Versus B = 65/100)

The Flow of Posterior Probability for H₀ and H₊ as a Function of the Accumulating Number of Observations for the Fictitious Bee Data (i.e., A = 50/100 vs. B = 65/100)

Prior and Posterior Distribution of the Log Odds Ratio Under H₁ as Applied to the Fictitious Bee Data (i.e., A = 50/100 vs. B = 65/100)

JASP (Version 0.16.3.0) Screenshot of the Summary Statistics Implementation of the `abtest` R Package

Posterior Distribution of the Difference δ = $θ_{B}^{Non-Crown}$ – $θ_{A}^{Non-Crown}$ for the Proportion of Non-Crown Games Between the Two Rekentuin Website Versions

Sequential Analysis of the Difference Between the Success Probabilities (i.e., $θ_{B}^{Non-Crown}$ − $θ_{A}^{Non-Crown}$ ) of the two Rekentuin Versions

The Flow of Posterior Probability for H₀ and H₊ as a Function of the Number of Observations Across Both Rekentuin Versions

Prior and Posterior Distribution of the Log Odds Ratio Under H₁ for the Rekentuin Data Set

Posterior Distribution of the Difference δ = θ_B − θ_A for the Click-Through Proportion Between the two Fictitious Website Versions

Sequential Analysis of the Difference Between the Click-Through Probabilities (i.e., θ_B − θ_A) of the two Fictitious Webshop Versions

Flow of Posterior Probability for H₀ and H₊ as a Function of the Number of Observations Across Both Fictitious Website Versions.

The Probability of θ₁ > θ₂

The Distribution of δ = θ₁ − θ₂