4 min read

Analyzing Split-Tests Using R

You don't have to be a data scientist or a statistician to analyze a split test using R.  The most challenging part is selecting an appropriate sample size, which requires understanding split testing parameters.  If you stop a split test immediately after observing the desired result, you can introduce bias into your experiment.  Stopping after collecting an appropriate sample size helps to ensure that you are making unbiased decisions within the error bounds you are comfortable with.

In the following examples, let's assume that we are trying to optimize the conversion rate of a sales funnel.  We will refer to our existing sales funnel as A and the changes we are split testing as B.  When calculating sample sizes, it's helpful to think of the conversion rates of A and B as random variables with a range of possible outcomes.  Thinking of conversion rates in this way allows us to include two error parameters in our sample size calculation:

  1. False positives (α): incorrectly concluding that the conversion rate of B is better than A.
  2. False negatives (β): incorrectly concluding that there is no difference between the conversion rate of A and B.

You can pick the values of α and β you are comfortable with, but it's common to set α to 5% and β to 20%.  Note that α is more commonly known as the significance level, but its value is the same as the false positive rate.  Note that 1-β is called "statistical power," which you will see later in this post.  You can also think of this as the "true positive" rate.

Another important parameter is the "minimum detectable effect," or MDE for short.  As the name implies, this is the minimum percent difference between A and B that you will be able to observe and conclude that the conversion rate of B is better than A.  The important thing to understand is that the smaller the MDE, the larger the required sample size.  This is because detecting small changes is more difficult than larger ones.  Therefore, a common practice is to set the MDE to 5% or more.

The final parameter you need to calculate the required sample size is the existing conversion rate of A.  You can obtain the existing conversion rate of A from business metrics.  There can be a lot of variability in conversion rates, so it's best to pick a lower value.  A lower value will result in a larger than required sample size, which is better than having too small of a sample size.

Once you have the required parameters, you can input them into a sample size calculator.  Evan Miller has an excellent sample size calculator on his website and additional split testing resources.  You can also calculate the sample size using the "pwr" package in R, as shown below.  Note that Evan Miller's calculator performs the calculation for two-sided tests, but the example below performs the calculation for a one-sided or "greater" test.

library(pwr)

# input parameters
alpha <- 0.05
beta <- 0.2
mde <- 0.05
cr <- 0.2

# calculate the effect size
h <- ES.h(cr*(1+mde), cr)

# calculate the sample size for a two-proportion test
n <- pwr.2p.test(h=h, sig.level=alpha, power=1-beta, alternative="greater")

# output the result
print(n)
Sample size calculation script using "pwr" library
     Difference of proportion power calculation for binomial distribution (arcsine transformation) 

              h = 0.02477242
              n = 20149.36
      sig.level = 0.05
          power = 0.8
    alternative = greater

NOTE: same sample sizes
Script output

As you can see, the required sample size (n) is 20,150.  This means 20,150 for A and 20,150 for B, for a total of 40,300.


After running your experiment, analyzing a split test in R is easy as inputting the results into the "prop.test" function, which is included in the standard library.

# sample size for a and b
a_n <- 20150
b_n <- 20150

# conversions for a and b
a_convs <- 4030
b_convs <- 4232

# test of equal proportions
t <- prop.test(c(b_convs, a_convs), c(b_n, a_n))

# output the results
print(t)
Analysis script using "prop.test"
	2-sample test for equality of proportions with continuity correction

data:  c(b_convs, a_convs) out of c(b_n, a_n)
X-squared = 6.151, df = 1, p-value = 0.01313
alternative hypothesis: two.sided
95 percent confidence interval:
 0.002092716 0.017956912
sample estimates:
   prop 1    prop 2 
0.2100248 0.2000000
Script output

The values of interest from the output are the sample estimates, the p-value, and the 95% confidence interval.  The sample estimates are simply the conversion rates for B and A, respectively.  Without getting into too much detail about what p-values are, the outcome is generally considered statistically significant if the p-value is lower than α or 0.05 in this case.  Statistically significant simply means that there is enough data to support the conclusion that there is a difference between the conversion rates of A and B.  Don't forget that there's still a small chance equal to α that the result could be a false positive.  The 95% percent confidence interval is the 95% confidence interval on the observed difference between B and A.  For instance, we observed approximately a 1% difference between B and A.  However, the difference could be as low as 0.2% or as high as 1.8% when considering that the conversion rates are modeled as random variables.


Below, I have included an example of what the output would look like if there is no statistically significant difference between A and B.

# sample size for a and b
a_n <- 20150
b_n <- 20150

# conversions for a and b
a_convs <- 4030
b_convs <- 4100

# test of equal proportions
t <- prop.test(c(b_convs, a_convs), c(b_n, a_n))

# output the results
print(t)
Analysis script using "prop.test"
	2-sample test for equality of proportions with continuity correction

data:  c(b_convs, a_convs) out of c(b_n, a_n)
X-squared = 0.7336, df = 1, p-value = 0.3917
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.004411553  0.011359444
sample estimates:
   prop 1    prop 2 
0.2034739 0.2000000
Script output

As you can see, it's impossible to conclude that there's a difference between the conversion rates of A and B because the observed difference is 0.35%, the p-value is greater than α, and the confidence interval is between -0.4% and +1.1%.  Note that there is a chance equal to β or 20% that the result is a false negative.