in Statistics

Drawing Inferences from A/B Tests on Proportions: Frequentist vs. Bayesian Approach

Introduction

Drawing inferences from A/B tests is an integral job to many data scientists. Often, we hear about the frequentist (classical) approach, where we specify the alpha and beta rates and see if we can reject the null hypothesis in favor of the alternate hypothesis. On the other hand, Bayesian inference uses Bayes’ Theorem to update the probability for a hypothesis to be true, as more evidence becomes available.

In this blog post, we are going to use R to follow the example in [1] and extend it with a sensitivity analysis to observe the impact of tweaking the priors on the findings. [1] has a great discussion on the advantages and disadvantages of Frequentist vs. Bayesian that I’d recommend reading. My main takeaways are that:

  1. Bayesian is often criticised for having a subjective prior, which we will examine in the sensitivity analysis section
  2. Frequentist is criticised for having different p-values with different experiment set-up, which we will examine in the next section under stopping rules
  3. “… for any decision rule, there is a Bayesian decision rule which is, in a precise sense, at least as good as a rule” – it doesn’t hurt for a data scientist to gain another perspective in making inferences

Case Study Background

The objective of the experiment is to check if a coin is biased, suppose the person who conducts the experiment (let’s call him the researcher) is not the same person who performs the analysis of results (let’s call him the analyst).

The researcher has two ways to stop the experiment (stopping rules):

  1. Toss the coin 6 times and report the number of heads
  2. Toss the coin until the first head appear

The researcher reports HHHHHT and his stopping rule to the analyst. However, the analyst forgot what is the stopping rule.

Frequentist Approach

The Frequentist analyst sets up the hypothesis:
\(H_0: \theta = 0.5, H_A : \theta > 0.5\)

Binomial Distribution

Under stopping rule (1), the number of heads is under the Binomial distribution. More formally, Observed Heads ~ Bin(6,0.5)

# Binomial Distribution
n = 6
num_heads = c(1:n)
pmf_binom <- dbinom(k,size=n,prob=0.5)
plot(num_heads,pmf_binom,type="h", main = "Prob mass function of a Binomial distribution")

# The following two lines are equivalent
1-pbinom(q=4,size=6,prob=0.5)
pmf_binom[5]+pmf_binom[6]

Prob(5 or 6 heads in 6 tosses) = 0.1094

Therefore, we fail to reject the null hypothesis at the 0.05 significance level.

Geometric Distribution

Under stopping rule (2), the number of flips required until heads appear is under the Geometric distribution.

Number_failures_until_1st_head ~ Geometric (0.5)

# Geometic Distribution

num_fails = c(0:10)
pmf_geom = dgeom (x = num_fails, prob=0.5)
sum(pmf_geom)
plot(num_fails, pmf_geom, type = "h", main = "Prob mass function of a Geometric dist.")

# The following two lines are equivalent
1- pgeom(q=4,prob=0.5)
1-sum(pmf_geom[1:5])

P(It takes at least 5 fails before the 1st head) = 0.0313

Therefore, we reject the null hypothesis at 0.05 significance level. Notice how the same data leads to opposite conclusions.

Under the frequentist approach, the stopping rule, which decides the distribution of the random variable, must be specified before the experiment.

Bayesian Approach

We want to estimate theta, which is defined as the true probability that the coin would come up heads. We use a beta distribution to represent the conjugate prior. In order not to lose the focus of the case study, we have introduced the beta distribution in the appendix.

As the prior distribution, let’s say the prior is under the Beta distribution (3,3), which suggests a fairly flat distribution around 0.5. This suggests that the analyst believes that the coin is fair, but uses (3,3) as an indication of his uncertainty. We will study the impact of changing these two parameters in the Sensitivity Analysis section. For now, let’s go with:

Theta_prior ~ Beta(3,3)

During the experiment, we have 6 flips, of which 1 is heads. Let’s fill in the following table:

Item Prior Experiment Posterior
Heads 3 5 8
Tails 3 1 4
Total Flips 6 6 12


# Bayesian Approach
theta=seq(from=0,to=1,by=.01)
plot(theta,dbeta(theta,8,4)
,type="l"
, ylim = c(0,6)
, col = "red"
, lwd =2
, ylab = "Prob. Density Function"
, main = "Prob. Density Function")

lines(theta,dbeta(theta,3,3),type="l", col = "green", lwd =2)
lines(theta,dbeta(theta,5,1),type="l", col = "blue", lwd =2)

abline(v=0.5, col='grey')

legend("topright",
legend = c("Posterior", "Prior", "Experiment"),
col = c("red", "green", "blue"),
bty = "n",
text.col = "black",
horiz = F ,
inset = c(0.1, 0.1),
lty = 1, lwd=2)theta=seq(from=0,to=1,by=.01)

1-pbeta(0.5, 8,4)
\(P(\theta > 0.5 | data) = 0.89 \)

i.e., 0.89 is the area under the red curve, to the right of 0.5. In the next section, we investigate the impact of changing the shape of the prior distribution on posterior probabilities.

Sensitivity Analysis of the impact of Prior Distribution on Posterior Probabilities

How would changing the prior distribution from Beta(3,3) have an impact on the posterior probability that theta > 0.5? In this section, we are going to change the variance and the expected value of the distribution as part of the sensitivity analysis.

\( mean = \frac{\alpha}{\alpha + \beta};\) \( variance = \frac{\alpha\beta}{(\alpha + \beta)^2 (\alpha + \beta + 1)}\)

 

(1) Changing the variance – When we inject a stronger prior with a lower variance that the coin is fair, the posterior probability is reduced from 0.89 to 0.84.

Distribution Mean Variance
\(P(\theta > 0.5 | data)\)
Beta(1,1) 0.5 0.083 0.94
Beta(2,2) 0.5 0.05 0.91
Beta(3,3) 0.5 0.036 0.89
Beta(5,5) 0.5 0.023 0.84
## Sensitivity Analysis - change the variance

par(mfrow = c(2,2))


alpha_prior = 5
beta_prior = 5

alpha_expt = 5
beta_expt = 1

alpha_post = alpha_prior + alpha_expt
beta_post = beta_prior + beta_expt

title = paste0("Prior Beta(", alpha_prior, "," , beta_prior, ")")

# Bayesian Approach

theta=seq(from=0,to=1,by=.01)
plot(theta,dbeta(theta,alpha_post,beta_post)
     ,type="l"
     , ylim = c(0,6)
     , col = "red"
     , lwd =2
     , ylab = "Prob. Density Function"
     , main = title)

lines(theta,dbeta(theta,alpha_prior,beta_prior),type="l", col = "green", lwd =2)
lines(theta,dbeta(theta,alpha_expt,beta_expt),type="l", col = "blue", lwd =2)

abline(v=0.5, col='grey')


# Prior Mean
alpha_prior / (alpha_prior + beta_prior)

# Prior Variance
(alpha_prior * beta_prior) / ((alpha_prior + beta_prior)^2 * (alpha_prior + beta_prior+1))

# P(theta > 0.5 | data)
1-pbeta(0.5, alpha_post,beta_post)

Above: Effect of changing prior variance whilst keeping mean constant. Green: Prior; Red: Posterior; Blue: Experiment

(2) Changing the mean – Similarly, and as expected, if we inject the prior that the coin is biased towards the tails, when the experiment is biased towards heads, we are less confident that the coin is biased towards heads.

Given mean and variance, I needed to compute alpha and beta. Thankfully, we have this stackoverflow post that help us do that:

For simplicity, we are rounding off alpha and beta to the nearest integer. Hence variance might be a little different.

par(mfrow = c(2,2))

mean = 0.7
variance = 0.036

alpha_prior = ((1-mean)/variance - 1/mean) * mean^2
beta_prior = alpha_prior * (1/mean - 1)

alpha_prior = round(alpha_prior,0)
beta_prior = round(beta_prior,0)


alpha_expt = 5
beta_expt = 1

alpha_post = alpha_prior + alpha_expt
beta_post = beta_prior + beta_expt

title = paste0("Prior Beta(", alpha_prior, "," , beta_prior, ")")

# Bayesian Approach

theta=seq(from=0,to=1,by=.01)
plot(theta,dbeta(theta,alpha_post,beta_post)
     ,type="l"
     , ylim = c(0,6)
     , col = "red"
     , lwd =2
     , ylab = "Prob. Density Function"
     , main = title)

lines(theta,dbeta(theta,alpha_prior,beta_prior),type="l", col = "green", lwd =2)
lines(theta,dbeta(theta,alpha_expt,beta_expt),type="l", col = "blue", lwd =2)

abline(v=0.5, col='grey')


# Prior Mean
alpha_prior / (alpha_prior + beta_prior)

# Prior Variance
(alpha_prior * beta_prior) / ((alpha_prior + beta_prior)^2 * (alpha_prior + beta_prior+1))

# P(theta > 0.5 | data)
1-pbeta(0.5, alpha_post,beta_post)


 

Distribution Mean Variance
\(P(\theta > 0.5 | data)\)
Beta(2,3) 0.4 0.04 0.83
Beta(3,3) 0.5 0.036 0.89
Beta(3,2) 0.6 0.04 0.94
Beta(3,1) 0.7 0.038 0.98

Above: Effect of changing prior mean, keeping variance constant. Green: Prior; Red: Posterior; Blue: Experiment

Conclusion

In conclusion, we have demonstrated the Bayesian perspective on A/B testing on small samples. We saw that the stopping rule is critical in establishing the p-value in the frequentist approach, and the stopping rule is not considered in the Bayesian approach. The Bayesian approach also gives an probability that a hypothesis is true, given the prior and experiment results. Lastly, we also observed how the posterior probability is affected by the mean and variance of the prior distribution.

Appendix – Beta distribution

The beta distribution is a family of continuous probability distributions defined on the interval [0,1] parametrized by two positive shape parameters, denoted by α and β. There are three reasons why the beta distribution is great for Bayesian inferences:

  1. The interval [0,1] makes it suitable to represent probabilities.
  2. It has the nice property that the posterior distribution is also a beta distribution. To be clear, the prior distribution refers to the distribution we believe theta exhibits before we do any analysis whilst the posterior distribution refers to the distribution we believe theta exhibits after we observe some samples.
  3. We can specify a large range of beliefs by changing a and b – the probability density function of theta, given a and b is the following:

From the above equation, we see that α and b control the shape of the distribution, and indeed, they are known as shape parameters. Let’s plug in some values into R and observe the difference in shapes. The expected value is computed by α / (α+β).

Color α β Mean = α / (α+β)
Black 0.5 0.5 0.50
Red (uniform) 1 1 0.50
Blue 3 3 0.50
Yellow 5 5 0.50
theta=seq(from=0,to=1,by=.01)

plot(theta,dbeta(theta,0.5,0.5)
,type="l"
, ylim = c(0,3)
, col = "black"
, lwd =2
, ylab = "Prob. Density Function")

lines(theta,dbeta(theta,1,1),type="l", col = "red", lwd =2)
lines(theta,dbeta(theta,3,3),type="l", col = "blue", lwd =2)
lines(theta,dbeta(theta,5,5),type="l", col = "yellow", lwd =2)

Notice how the mean of all four distributions is the same at 0.5, and different distributions could be specified. This is what we meant by a large range of beliefs could be specified using the beta distribution.

References

[1] Jeremy Orloff, and Jonathan Bloom. 18.05 Introduction to Probability and Statistics. Spring 2014. Massachusetts Institute of Technology: MIT OpenCourseWare, https://ocw.mit.edu. License: Creative Commons BY-NC-SA.

Facebooktwittergoogle_plusredditpinterestlinkedinmail

Write a Comment

Comment