^{a}

^{b}

^{b}

The need for a comparison between two proportions (sometimes called an A/B test) often arises in business, psychology, and the analysis of clinical trial data. Here we discuss two Bayesian A/B tests that allow users to monitor the uncertainty about a difference in two proportions as data accumulate over time. We emphasize the advantage of assigning a dependent prior distribution to the proportions (i.e., assigning a prior to the log odds ratio). This dependent-prior approach has been implemented in the open-source statistical software programs R and JASP. Several examples demonstrate how JASP can be used to apply this Bayesian test and interpret the results.

Practitioners predominantly use ^{1}

For example, with the data in hand one may find that

It should be acknowledged, however, that non-standard (i.e., less popular) forms of frequentist analyses exist that alleviate some of the concerns listed above. For instance, sequential inference can be carried out by the Sequential Probability Ratio Test (

The limitations of standard frequentist statistics can be overcome by adopting a Bayesian data analysis approach (e.g.,

Let _{A}_{A}_{B}_{B}

This model assumes that _{A}_{B}_{A} and θ_{B}. These success probabilities are assigned independent beta(α, β) distributions that encode the relative prior plausibility of the values for θ_{A} and θ_{B}. In a beta distribution, the α values can be interpreted as counts of hypothetical ‘prior successes’ and the β values can be interpreted as counts of hypothetical ‘prior failures’ (

Data from the A/B testing experiment update the two independent prior distributions to two independent posterior distributions as dictated by Bayes’ rule:
_{A}) and _{B}) are the prior distributions and _{A}_{A}_{A}) and _{B}_{B}_{B}) are the likelihoods of the data given the respective parameters. Hence, the reallocation of probability from prior to posterior is brought about by the data: the probability increases for parameter values that predict the data well and decreases for parameter values that predict the data poorly (^{2}

When the prior and the posterior belong to the same family of distributions they are said to be

_{A}− θ

_{B}between the success rates of the two experimental groups, as this difference indicates whether the experimental condition shows the desired effect.

The IBE approach is implemented for instance in the

Before setting up this A/B test and collecting the data, the prior distribution has to be specified so that it represents the relative plausibility of the parameter values. For the present example, the researcher specifies two uninformative (uniform) beta(1,1) priors. After running the A/B test procedure, the priors are updated with the obtained data. With

```
R> library(bayesAB)
R> bees1 <- read.csv2("bees_data1.csv")
R> AB1 <- bayesTest(bees1$y1, bees1$y2,
+ priors = c('alpha' = 1, 'beta' = 1),
+ n_samples = 1e5, distribution = 'bernoulli')
```

A more detailed explanation of the function and its arguments can be obtained by typing

```
R> summary(AB1)
R> plot(AB1)
```

_{A} is beta(α_{A} +_{A}_{A} +_{A}_{B} is beta(α_{B} +_{B}_{B} +_{B}^{3}

The posterior distributions are available analytically, so at this point the rbeta function is not needed; it will become relevant once we start to investigate the posterior distribution for the difference between θ_{A} and θ_{B}.

The _{A}. The advantage of expressing the difference as a proportion of θ_{A} is that a change from 1% to 2% (i.e., a doubling of the conversion rate) is seen to be much more impressive than a change from 50% to 51%. The associated disadvantage is that a small change can appear more impressive than it really is. The posterior distribution for the conversion rate uplift is computed from the random samples obtained for the two beta posteriors shown in

The posterior probability that θ_{B} > θ_{A} can also be obtained analytically (_{B} > θ_{A} | data) = 0.984. In fact the entire posterior distribution for the difference between the two independent beta distributions is available analytically (^{4}

The reason for this is numerical overflow from Appell’s first hypergeometric function.

In this case, one can instead employ a normal approximation, the result of which is shown in the right panel of^{5}

The appendix contains the formulas from

One advantage of the Bayesian approach is that the data can also be added to the analysis in a sequential manner. This means that the evidence can be assessed continually as the data arrives and the analyses can be stopped as soon as the evidence is judged to be compelling (_{A} and θ_{B} as well as the 95% highest density interval (HDI) of the difference in a sequential manner. The HDI narrows with increasing sample size, indicating that the range of likely values for δ gradually becomes smaller. After some initial fluctuation, the posterior mean difference between θ_{A} and θ_{B} (i.e., the orange line) settles between 0.1 and 0.2. The R code for the sequential computation can be found in the OSF repository (

In sum, the IBE approach allows practitioners to judge the size and direction of an effect, that is, the difference between the two success probabilities. It is important, however, to recognize the assumptions that come with this approach. In the next section, we will elaborate on these assumptions and their consequences.

The IBE approach makes two important assumptions. The first assumption is that the two success probabilities are independent: learning about the success rate of one experimental condition does not affect our knowledge about the success rate of the other condition (

Do English or Scots cattle have a higher proportion of cows infected with a certain virus? Suppose we were informed (before collecting any data) that the proportion of English cows infected was 0.8. With independent uniform priors we would now give _{1} > _{2}) a probability of 0.8 (… ) In very many cases this would not be appropriate. Often we will believe (for example) that if

The second assumption of the IBE approach is that an effect is always present; that is, training the bees to prefer a certain color may increase approach rates or decrease approach rates; it is never the case that the training is completely ineffective. This assumption follows from the fact that a continuous prior does not assign any probability to a specific point value such as δ = 0 (

An A/B test model that assigns prior mass to the null hypothesis of no effect was introduced by

As before, this model assumes that _{A}_{B}_{A} and θ_{B}. However, the success probabilities are a function of two parameters, γ and ψ. Parameter γ indicates the grand mean of the log odds, while ψ denotes the distance between the two conditions (i.e., the log odds ratio;

While the choice of a prior for γ is relatively inconsequential for the comparison between ^{6}

Note that the overall prior distribution for ψ can be considered a mixture between a ‘spike’ at 0 coming from

We consider four hypotheses that may be of interest in practice:

_{A} and θ_{B} are identical.

_{A} and θ_{B} are not identical.

_{B} is larger than the success probability θ_{A}.

_{A} is larger than the success probability θ_{B}.

By comparing these hypotheses, practitioners may obtain answers to the following questions:

Is there a difference between the success probabilities, or are they the same? This requires a comparison between

Does group B have a higher success probability than group A, or are the probabilities the same? This requires a comparison between

Does group A have a higher success probability than group B, or are the probabilities the same? This requires a comparison between

Does group B have a higher success probability than group A, or does group A have a higher success probability than group B? This is the question that is also addressed by the IBE approach discussed earlier, and it requires a comparison between

To quantify the evidence that the observed data provide for and against the hypotheses we compare the models’ predictive performance.^{7}

We use the terms ‘model’ and ‘hypothesis’ interchangeably.

For two models, say_{+0}indicates the extent to which

The evidence from the data is expressed in the Bayes factor, but to compare two hypotheses in their entirety, the

The prior odds quantify the plausibility of the hypotheses before seeing the data, while the posterior odds quantify the plausibility of the two hypotheses after taking the data into account (

To demonstrate the analyses with the LTT approach we can use the

It is recommended that a hypothesis is specified before setting up the A/B test (

For illustrative purposes we assume that in the present example there is little prior knowledge, which motivates the specification of an uninformed standard normal prior distribution:

```
R> library(abtest)
R> bees2 <- as.list(read.csv2("bees_data2.csv")[-1,-1])
R> prior_prob <- c(0, 0.5, 0, 0.5)
R> names(prior_prob) <- c("H1", "H+", "H-", "H0")
R> AB2 <- ab_test(data = bees2, prior_par = list(mu_psi = 0,
+ sigma_psi = 1, mu_beta = 0, sigma_beta = 1),
+ prior_prob = prior_prob)
```

As shown in the code above, the standard normal prior on ψ is specified by assigning values for _{+0} equals 4.7, meaning that the data are approximately 5 times more likely under the alternative hypothesis

The robustness of this conclusion can be explored by changing the prior distribution on ψ (i.e., by varying the mean and standard deviation of the normal prior distribution) and observing the effect on the Bayes factor. _{ψ} and σ_{ψ}. The Bayes factor is highest for low σ_{ψ} values and μ_{ψ} ≈ 0.6. The heatmap shows that our conclusion regarding the evidence for

```
R> plot_robustness(AB2, mu_range = c(0, 2), sigma_range = c(0.1, 1),
+ bftype = "BF+0")
```

_{ψ} ≈ 0.6. The conclusion that there is moderate evidence for H_{+} over H_{0} holds across a large range of values for µ_{ψ} and σ_{ψ}.

A sequential analysis tracks the evidence in chronological order.

`R> plot_sequential(AB2)`

Having collected evidence for the hypothesis that trained bees prefer the blue disc more than do untrained bees, one might then wish to quantify the size of this difference in preference. To do so, we switch from a testing framework to an estimation framework. For this purpose, we adopt the two-sided model

`R> plot_posterior(AB2, what = 'logor')`

The dotted line in

The posterior distributions for the two success probabilities can be inspected separately using:

`R> plot_posterior(AB2, what = 'p1p2')`

The

_{0} and H_{+} as applied to the fictitious bee data (i.e., A = 50/100 versus B = 65/100). The left panel shows the input options and the right panel shows the associated analysis output.

Using the

The second method to activate the

To showcase the different approaches to Bayesian A/B testing we now apply the methodology to two example data sets.^{8}

For general recommendations on how to apply Bayesian procedures and interpret the results, consider

Rekentuin (Dutch for ‘math garden’) is a tutoring website where children can practice their arithmetic skills by playing adaptive online games. The Rekentuin website is visited by Dutch elementary school children between the ages of 4 and 12. During the testing interval from the 22nd of January 2019 to the 5th of February 2019, a total of 15,322 children were active on Rekentuin.

The left-hand panel of

In 2019, the developers of Rekentuin faced the challenge that many children would preferentially engage with the class of arithmetic problems that they had already mastered (e.g., addition)—a sensible strategy if the goal is to maximize the number of coins gained. To incentivize the children to practice other classes of arithmetic problems (e.g., subtraction) the developers implemented a ‘crown’ for the type of games that the children had already mastered (see

To induce the children to play other games, the Rekentuin developers constructed a less subtle manipulation: they removed the virtual reward (i.e., the coins) from the crown games. To test the effectiveness of this manipulation, the Rekentuin developers designed an A/B test. Half of the children continued playing on an unchanged website (Version A), whereas the other half could no longer earn coins for crown games (Version B). The children playing Version B were not notified of the change but had to discover the changes for themselves.

The question of interest is whether changing the incentive structure for crown games (i.e., removing the coins) had the desired effect. To address this question we analyzed the Rekentuin data set using the two Bayesian A/B testing approaches outlined earlier.

The data were collected by Abe Hofman and colleagues on the Rekentuin website in 2019. All intended analyses were applied to synthetic data and the associated analysis scripts were stored on a repository at the OSF. We did not inspect the data before the preregistration was finalized. All preregistration materials as well as the real data are available in the

Our analysis concerns the last game that each child played during the testing interval: was it a crown game or not? By examining only the last game we obtain a binary variable (required for the present A/B test) and also allow children the maximum opportunity to experience that crown games no longer yield coins.

We excluded children from the analyses according to two criteria. Firstly, we excluded 8573 children who did not play any crown game during the time of testing because they could not have experienced the experimental manipulation in Version B. Secondly, we excluded 350 children who only played one crown game and it was their last game, because for these children we cannot observe the potential influence of the manipulation on their playing behavior. In total, we therefore excluded 8923 children.

The Rekentuin data are summarized in

Game Type |
|||
---|---|---|---|

Coins | Non-Crown | Crown | Total |

Yes | 2272 | 906 | 3178 |

No | 2596 | 625 | 3221 |

Total | 4868 | 1531 | 6399 |

As before, in the IBE approach we assigned two uninformed beta(1,1) distributions to the success probabilities of Versions A and B.^{9}

Researchers with access to pre-intervention data could instead consider to use an informed prior distribution, although there is always a risk that the pre-intervention data differ from the post-intervention data on some unknown dimension.

Consistent with the intuitive impression from ^{non-crown} than that in Version A. This suggests that the success probability of the modified Rekentuin version is higher and that removing the coins from the crown games had a positive impact on the number of non-crown games played.

In addition to the _{B} > θ_{A}) ≈ 1. The analytic calculation of the posterior distribution for the difference

For the LTT approach, we compare the directional hypothesis ^{2} = 1 under the alternative hypothesis as there is a range of parameter values that seem plausible (see, for example,

The observed sample proportions of 0.806 for Version B and 0.715 for Version A suggest that the children in Version B played more non-crown games as compared to Version A. The Bayes factor BF_{+0} that assesses the evidence in favor of our hypothesis that the children in Version B played more non-crown games equals 7.944e+14. This means that the data are about 800 trillion times more likely to occur under the alternative hypothesis _{ψ} and standard deviation σ_{ψ} of the normal prior distribution. From looking at the heatmap we can conclude that the Bayes factor is robust. The data indicate extreme evidence across a range of different values for the prior distribution on ψ.

In sum, the evidence in favor of the alternative hypothesis is overwhelming. To complete the picture, we quantified the difference between the two Rekentuin versions by estimating the size of the log odds ratio.

The Rekentuin manipulation directly targeted children’s motivation to play the games. Common A/B tests for web development purposes implement more subtle manipulations that result in much smaller effect sizes. In this section we analyze such a scenario.

Consider the following fictitious scenario: an online marketing team seeks to improve the click rate on a call-to-action button on their website’s landing page. Therefore, they devise an A/B test. Half of the website visitors read ‘Try our new product!’ (Version A), and the other half reads ‘Test our new product!’ (Version B).^{10}

This example was inspired by a real conversion rate optimization project at

To demonstrate the analyses we use synthetic data. The corresponding R code can be found at the

Click on Button |
|||
---|---|---|---|

Condition | Yes | No | Total |

A | 1131 | 8869 | 10000 |

B | 1275 | 8725 | 10000 |

Total | 2406 | 17594 | 20000 |

We again use the _{A} and θ_{B} (

The left-hand panel of

The right-hand panel of

The analytically calculated posterior probability of the event θ_{B} > θ_{A} equals 0.999 (

_{A} and θ_{B} as well as the 95% HDI of the difference in a sequential manner. With increasing sample size, the HDI becomes more narrow. This indicates that the range of likely values for δ becomes smaller. After some initial fluctuation, the posterior mean difference between the two success probabilities θ_{A} and θ_{B} settles at ∼ 0.014.

Before the data can be analyzed according to the LTT approach, a prior distribution for the log odds ratio has to be specified. For this purpose, it is important to note that the subtle manipulations of common A/B tests generally result in very small effect sizes. The effect size of website changes (i.e., the difference in conversion rates between the baseline version and its modification) is typically as small as 0.5% or less (

For the present example, we will compare the impact of different prior distributions on the analysis outcome (i.e., a sensitivity analysis). Suppose that the online marketing team specifies three prior distributions. Firstly there is the prior distribution specified by a team member who is still relatively unfamiliar with conversion rate optimization. This team member lacks substantive knowledge about plausible values of the log odds ratio and prefers to use the uninformed standard normal prior, truncated at zero to represent the expectation of a positive effect (see _{ψ} = 0.18; see _{ψ} = 0.05; see _{ψ} = 0 and σ_{ψ} = 1, the optimistic prior with μ_{ψ} = 0.18 and σ_{ψ} = 0.005, and the conservative prior with μ_{ψ} = 0.05 and σ_{ψ} = 0.03.

^{2}). The conservative prior (right-hand panel), ^{2}).

We used the ^{11}

It follows from transitivity that the optimistic colleague outpredicted the pessimistic colleague by a factor of 80/27 ≈ 2.96.

The influence of the prior distribution on the Bayes factor can be explored more systematically with the Bayes factor robustness plot, shown in _{ψ} and the standard deviation σ_{ψ} of the prior distribution on ψ shows that the BF_{+0} mostly ranges from about 10 to about 60. The evidence is generally less compelling for prior distributions that are relatively wide (i.e., high σ_{ψ}) or relatively peaked but away from zero (i.e, low σ_{ψ} and high μ_{ψ}). In both scenarios, substantial predictive mass is wasted on effect sizes that are unreasonably large, and were unlikely to manifest themselves in the context of the present webshop A/B experiment.

_{1} over H_{0} across a range of reasonable values for μ_{ψ} and σ_{ψ}. The evidence is less compelling when the prior for the log odds ratio is relatively wide (i.e., when σ_{ψ} is relatively high) or far away from zero (i.e., when μ_{ψ} is relatively high).

The prior and posterior probabilities of the hypotheses are displayed on top of

In sum, the fictional webshop data present strong to very strong evidence for the claim that the conversion rate is higher in Version B than in Version A (

The A/B test concerns a comparison between two proportions and it is ubiquitous in medicine, psychology, biology, and online marketing. Here we outlined two Bayesian A/B tests: the ‘Independent Beta Estimation’ or IBE approach that assigns independent beta priors to the two proportion parameters, and the ‘Logit Transformation Testing’ or LTT approach that assigns a normal prior to the log odds ratio parameter. These approaches are based on different assumptions and hence ask different questions. We believe that the LTT approach deserves more attention: in many situations, the assumption of independence for the proportion parameters is not realistic. Moreover, only with the LTT approach is it possible for practitioners to obtain evidence in favor of or against the null hypothesis.^{12}

It is possible to expand the IBE approach and add a null hypothesis that both success probabilities are exactly equal (e.g., Jeffreys, 1961), yielding an Independent Beta Testing (IBT) approach. A discussion of the IBT is beyond the scope of this paper (cf.

The LTT approach could be extended to include the possibility of an interval-null or perinull hypothesis to replace the traditional point-null hypothesis (

Despite its theoretical advantages, the Bayesian LTT approach has been applied to empirical data only sporadically. This issue is arguably due to the fact that many researchers are not familiar with this procedure and the practical advantages that it entails. The fact that the LTT approach had, until recently, not been implemented in easy-to-use software is another plausible reason for its widespread neglect. In this manuscript we outlined the Bayesian LTT approach and showed how implementations in R and JASP make it easy to execute. In addition, we demonstrated with several examples how the LTT approach yields informative inferences that may usefully supplement or supplant those from a traditional analysis.

Below we summarize three key results concerning the IBE approach to the Bayesian A/B test where the interest centers on the difference between two binomial chance parameters θ_{1} and θ_{2} that are assigned independent beta priors. Thus, we have θ_{i} ∼ Beta(α_{i}, β_{i}), _{1} − θ_{2}. As described in the main text, observing _{1} successes and _{1} − _{1} failures results in beta posterior distribution with parameters α_{1} + _{1} and β_{1} + _{1} − _{1}. To keep notation simple we assume that the updates have already been integrated into the parameters of the beta distribution.

The first result below gives an expression for the probability that θ_{1} > θ_{2} (_{1} − θ_{2} (

The posterior probability that θ_{1} > θ_{2} (_{3}_{2} is the generalized hypergeometric function.

The distribution of δ = θ_{1} − θ_{2} (

For 0 < δ ≤ 1, _{1}, β_{1})_{2}, β_{2}) and _{1} is Appell’s first hypergeometric function.

Moreover, if α_{1} + α_{2} > 1 and β_{1} + β_{2} > 1 we have:

The _{X}, σ^{2}_{X}) and _{Y}, σ^{2}_{Y}) is given by another normal distribution as _{X} − μ_{Y} , σ^{2}_{X} + σ^{2}_{Y}). When the parameters of the composite beta distributions are sufficiently large, the difference between a _{2}, β_{2}) and a _{1}, β_{1}) distribution is therefore approximated as follows:

Data is freely available at

The supplementary materials provided are the data, all preregistration materials, and an online appendix and can be accessed in the

This research was supported by the Netherlands Organisation for Scientific Research (NWO; grant #016.Vici.170.083).

E. J. W. declares that he coordinates the development of the open-source software package JASP (https://jasp-stats.org), a non-commercial, publicly-funded effort to make Bayesian and non-Bayesian statistics accessible to a broader group of researchers and students.

The authors are grateful to Oefenweb for allowing them to analyze the anonymized data and make it publicly available. The use and publication of the anonymized Rekentuin data has been coordinated with and permitted by Oefenweb.