" We need to use it or lose it
You've been tasked with improving conversion rate on the category pages, and your boss is eager to roll out the new ideas you have in the pipeline.
GoolyBib Office – Thursday Afternoon
You normally roll out new experiments on the category pages monthly, but it's the end of the last week of the month and you're not confident the latest test is decided.
A/B Testing for a Baby Tech's Team
As a product manager at GoolyBib, a promising young DTC startup selling wearable baby tech online, you are trying to prioritize your A/B test backlog. Your specific focus is on the category pages, and you have a lot of killer ideas you want to try. Unfortunately with a low volume of traffic, you can only run about 1 experiment per month.\nThis month however the month is up, and you still can't tell which variation is working best. Do you stop the test now and give up on something that might have improved performance? Or do you keep it running a little bit longer to give it a chance? The team worked hard on this test and thought it would work, so don't want to give up to soon. At the same time, they're excited by the new ideas in the pipeline, so it's hard to know what to do.\nYou need to know the chances of being wrong by turning off the test: maybe statistics can help?
I'd like to see more progress on testing\nWe haven't had a big win yet but you have a lot of good ideas\nIf you think it's worth testing this one for another few weeks let's do it, but we need to be sure that's the right move\nLet me know what you decide
When you're running an A/B testing program you quickly run into a limit on the number of experiments you can run. Your website might have a decent level of traffic, but once you split it down to different sections or traffic sources, you quickly run into areas where you can only afford to test 1-2 things a month.\nPrioritization is important, because 10-20% of A/B tests succeed, so if you're running 1 test a month you might have to go 5+ months before celebrating a win. That means you need to be able to decide when to stop a test and move on: sometimes there isn't a clear enough difference within 30 days. If you stop the test early, you might be missing out on a big win. However if you leave it running, you'll be using up days from the next cycle that you could be using to test the next idea.\nYour team works hard coming up with new ideas, and then doing all the work to create the designs, write the copy and build the pages is a big investment. They get excited by testing something new, so the temptation would be to turn off the old experiement. However they were equally excited by this idea when they had it, so you shouldn't be so quick to abandon your test.\nYou can compare the conversion rates of version A and B after the test has been running for a couple of weeks, but that's not enough. You need to know if the test result is statistically significant: otherwise the result you're seeing could be random noise, and you might make the wrong decision. Statistical significance is a function of how many people were exposes to the different versions of the test, and the observed difference between them. So a smaller difference on a less popular section of the website will take longer to test and reach significance.
...at Convert.com, we analysed 700 experiments and found again that 1 out of 7 experiments (14%) ran made a positive impact on the conversion rate
This problem of too many experiments and not enough data never goes away: you may reach statistical significance quicker with more data, but as soon as you have more data you want to drill into more granular problems. Marketers talk a lot about big data, but actually it's knowing what to do with small data that's of more practical importance.
Sample sizes are never large. If $N$ is too small to get a sufficiently-precise estimate, you need to get more data (or make more assumptions). But once $N$ is \"large enough,\" you can start subdividing the data to learn more (for example, in a public opinion poll, once you have a good estimate for the entire country, you can estimate among men and women, northerners and southerners, different age groups, etc.). $N$ is never enough because if it were \"enough\" you'd already be on to the next problem for which you need more data.
Is this a/b test significant?
You need to make a decision on whether to conclude the test and roll out new ideas. So you pull the data and calculate statistical significance.
GoolyBib Office – Friday Morning
The experiment is too close to call, so you want to calculate statistical significance and see if you can conclude the test yet. You search for a statistical significance calculator.
Statistical Significance Calculator
When you're running an A/B test, you can't just look at the difference in conversion rate after a couple of weeks and conclude that one version is better than the other. You need to do a statistical significance calculation, which tells you mathmatecally whether this observed result is likely to be true, or just the result of random noise. If you don't have much traffic and/or you aren't observing a very big difference in the two versions, it's highly likely that the test isn't yet statistically significant. Luckily we don't need to get into the statistics much ourselves: there are hundreds of free A/B test calculation tools available online. We just need to plug in the numbers and it'll tell us if the test result is significant.
What version has a higher observed conversion rate so far?
Was the test result statistically significant?
Explain in your own words what statistical significance means. Do some Googling if you're unsure.
Stop, drop or roll?
Maybe there's a solution: Bayesian methods could tell us how confident we can be that we're not missing a potential winner.
GoolyBib Office – Friday Afternoon
Your test isn't significant... you ask a friend for help, and they come through. They send you a Bayesian testing template they got from a recent course.
Hey have you tried bayesian testing?\nIt should help you quantify the risk that you're making the wrong decision by turning it off\nLet me share the template I got from the Reforge course I did
Statistical significance testing like we did in the last chapter, comes from the 'frequentist' branch of statistics. These methods tend to rely on mathmatical calculations that require certain assumptions to work, and the results of these calculations can be difficult for non-statisticians to interpet. For example many people interpret the P Value in statistics to mean 'the probability my test beat the control' but that's not strictly what it means!\nBayesian statistics instead relies on simple probability rules – this method identifies all possible scenarios using 'monte-carlo' simulation, then just adds up the scenarios where your hypothesis is true. This can lead to more interpretable results because the outcome really is the 'probability that the test beat the control'. Bayesian methods can be computationally intensive, but with the speed of modern computers that's no longer a major concern.\nTo estimate the confidence of your test result, you will use the template provided and the data exported from a Google Optimize experiment as reported by Google Analytics. Once you enter data into the template bear in mind it takes a few seconds to update (progress bar top right).
We recommend choosing weak priors for Bayesian testing: take the conversion rate you would normally see and replicate it but out of 100 samples. So for example if you had a conversion rate of 50%, you would put 50 wins and 50 losses in the prior section. This will give your model the assumption that performance of the test version is the same as the control, while also introducing uncertainty. You can see this in how the range is wider in the \"Prior Beta Distribution - Beta(α,β)\" Chart.
In the case of this test, the prior performance of the control variation was 1411 conversions from 2,376 visitors to the category page. How would you translate that into the \"Input Prior Data Here\" section?
Make a copy of the template and input your data. What is the recommendation from the conclusion section?
What are the chances that the solution is worse than the control?
Based on the 81% chance that the solution is better than the control, could we be reasonably confident that we can roll this test result out without adverse effects? What would be your decision – roll out the solution or drop it in favor of going back to the control?
Reforge Experimentation + Testing Deep Dive Course
The template from this chapter comes from the Reforge experimentation course. The founders Bryan Balfour and Andrew Chen were very early in the growth marketing scene and their insructors were headhunted from impressive startups, so the content of the courses is legitimately good. If you work in a growth role at a series A startup or larger, we highly recommend you check out their courses.
This is the third Reforge course I have taken, and I continue to be extremely impressed with the quality of the course content, the instructors, the guest speakers, the alumni network, and the forums.
I am not uncertain
It's unreasonable to expect our boss to understand our calculations: how do we translate our findings into terms they can understand?
GoolyBib Office – Friday Early Evening
You've made your decision, now you need to decide how to explain the results. Your boss won't know what Bayesian statistics is, so you need to build your own intuition first.
One of the best things about a model template like our Bayesian testing calculator, is that you can play around with them to build an understanding of how they work. For example by changing the number of conversions for the control, you can see how many conversions you would have needed to get a statistically significant result. You can also look at what the model would conclude if you fast forward a week with the same observed conversion rate, but just with more data. You can also adjust model assumptions, for example adjusting the priors to be stronger or weaker to see how they impact the final result. By simulating these different scenarios, you can build intuition for how the model behaves, which will help you explain results more confidently to managers.
Take a look at the Cumulative Probability of Solution Beating Control by > X chart. This is the CDF (Cumulative Distribution Function), which tells you the probability of the solution beating the control by a certain percentage difference. What is the probability of the solution beating the control by at least 2%?
Your choice of priors is an important concept in Bayesian testing. What you decide here impacts the outcome of the experiment, and allows you to incorporate your own knowledge of what's likely to succeed into the decision making process, while still using robust statistics. For example if you were more confident in the new version, you could increase the amount of data in the Solutions priors (out of 1,000 instead of 100) or even tell the model you expect it to perform better, by choosing a higher conversion rate. Try playing around with the template to see how it impacts the conclusion (wait for the spreadsheet to load after each edit, it takes a few seconds!).
Change the priors for the solution to be identical to the Control (1411 wins and 965 losses). How does that change the result of the model? Why is that?
If your experiment needs a statistician, you need a better experiment.
When you're running a test and you see a statistically insignificant result within the alloted time period, that can be a sign that the aspect you were testing just wasn't that important. In scenarios like this you don't want to make a mistake by rolling out the wrong variation, but also you don't want to spend too much time on a dead end. Good testing is about maintaining a good velocity, and because you're always learning more about your customers, you're right to bias towards new tests over old ones: these hypotheses were formed based on more up to date information so should on balance have a better chance of success!
Explaining Bayesian Decisions
One nice thing about Bayesian stats is that once you strip back the statistical terms, the findings are often more intuitive to interpret than frequentist stats. For example most people interpret statistical significance as 'the probability that variation is better than control', which isn't actually true in frequentist statistics, but it is true in Bayesian. This means from this Bayesian testing template you can say something like \"this test is showing a 3% improvement, and we have 80% chance it's beating the control\" and be confident in making that assertion, without wondering what ancient statistics gods you're angering. The other good thing is that no sample size is required with Bayesian statistics, so it's perfect for interpreting smaller experiments where you don't have much data, getting you actionable results faster. For more background on how to interpret the results of the template, read the following article by 'Positive' John Ostrowski.
When should you stop a Bayesian A/B test?
What are the advantages of Bayesian A/B Testing over Frequentist methods?
Ok this is really clear\nThis is a pretty cool tool for us to use\nIt should make these decisions a lot faster for us going forward\nGood work"