A/B Testing Troubles? Here’s How to Tackle Them

Tristan Kime
Nov 12, 2024
7 min read

A/B testing is a powerful tool for optimizing websites, marketing campaigns, and product features. However, it comes with its fair share of challenges. Most of my recent roles have involved building and/or leading AB testing teams focusses on online conversion rate optimization. When looking at why some individual tests, or even portions of programs, have failed some common themes have emerged. Let's explore some common difficulties in A/B testing and discuss strategies to overcome them.

Lack of Overarching Strategic Vision

One often overlooked challenge in A/B testing is the lack of a comprehensive strategic plan. Without a clear roadmap, businesses may find themselves conducting random tests that don't contribute to their overall goals or wasting resources on low-impact experiments. Ideally, the test plan is

To address this challenge:

Develop a Testing Roadmap
- Create a prioritized list of testing ideas aligned with your business objectives.
- Use frameworks like ICE (Impact, Confidence, Ease) or PIE (Potential, Importance, Ease) to score and rank your testing ideas.
- Be prepared to pivot or update your testing plan based on results of previous tests or changes in technical capabilities.
Create Individual Test Plans
- Clearly articulate the hypothesis of the test plan, segments to be tested, outcomes to be measured, and any design and develop necessary to start.
- Formalized test plans alleviate confusion among the testing, design, and development teams and provide a historical reference to understand the exact details of a test that was run in the past.
Set Clear Goals and KPIs
- Define specific, measurable goals for each test that tie into broader business objectives.
- Establish key performance indicators (KPIs) to track progress and measure success.
Allocate Resources Effectively
- Determine the necessary resources (time, personnel, tools) for each test in advance.
- Balance your testing efforts across different areas of your business or website.
Plan for Follow-up Actions
- Outline potential next steps based on various test outcomes.
- Prepare for quick implementation of successful variations
Foster a Culture of Experimentation
- Encourage team members to propose testing ideas and hypotheses.
- Share test results and learnings across the organization to build support for the testing program.
Over Communicate with Leadership
- Provide periodic and meaningful updates to the leadership of your program.
- Initially, explain the strategic plan, update on the progress of ongoing tests, the development of new tests, any obstacles in the way, and any suggested changes to the plan.

By implementing a strategic plan for your A/B testing efforts, you can ensure that your tests are purposeful, aligned with business goals, and contribute to long-term growth and optimization. This approach helps overcome the challenge of conducting isolated or irrelevant tests and maximizes the value of your testing program.

Sample Size and Statistical Significance

One of the most significant challenges in A/B testing is achieving a large enough sample size to draw statistically significant conclusions. Small sample sizes can lead to unreliable results and false positives.

Incremental tests on small volumes of traffic can take much longer to run than many stakeholders believe. As an example, assume you had a advertisement with 10,000 weekly visitors with 200 weekly conversions (a 2% CTR). If you want a confidence level of 95% and statistical power of 80% for that test it will take about:

1 week to show a 35% minimum detectable effect
2 weeks to show a 25% minimum detectable effect
5 weeks to show a 15% minimum detectable effect
48 weeks to show a 5% minimum detectable effect

If that weekly traffic increases to 100,000 weekly visitors with the same 2% CTR you can see a minimum detectable effect of 5% in 4 to 5 weeks.

Assuming an appropriate population size, I recommend running a test for 2 weeks. In some cases, the sample sizes would allow for shorter testing timelines but every business I have worked with has had differences customer behavior based on the day of the week. What is effective on a Monday through Wednesday may not be effective for the remainder of the week. Two weeks allows you to smooth out some of the discrepancies you see in short term testing.

To overcome this:

Use a sample size calculator to determine the required number of participants before starting your test. Speero has a tool that I find useful for both estimating test timing and determining significance while a test is running.
Be patient and allow your test to run for an adequate duration to collect sufficient data.
If your website has low traffic, consider focusing on testing elements with potentially larger impacts or running tests for longer periods.

Choosing the Right Metrics

Selecting appropriate metrics to measure can be tricky. It's essential to balance relevance, accuracy, and simplicity when defining what constitutes success for your test. Much of my work has been on optimizing the online purchase path for subscription services. Ideally, for most of that testing sales conversion would be the primary KPI. Unfortunately, at times there is not enough traffic in the segments you would like to test to run a significant test in a reasonable time period. In these cases, choosing a metric higher in the funnel, like Click Through Rate (CTR), can provide insights into what encourages users to move to the next step in the process.

To address this challenge:

Align your metrics with your overall business goals.
Use clear and consistent definitions for your metrics.
Avoid vanity metrics that don't reflect the true impact of your changes.
Choose a metric higher in your conversion funnel that drive traffic and allows you to run a more viable test.

Avoiding Bias and Maintaining Objectivity

It's easy to fall prey to confirmation bias or make decisions based on personal preferences rather than data. When any test is being designed it is typically being suggested because it is expected to be a winner. There can be bias and pressure to show that these test win. This can lead to misinterpretation of results and poor decision-making. In one HBR article the authors notes,

"...80 to 90 percent of the A/B tests I’ve overseen would count as “failures” to executives. By failure they mean a result that’s “not statistically significant”—A wasn’t meaningfully different from B to justify a new business tactic, so the company stays with the status quo." [1]

While that failure number does seem high, it is indicative of the fact that most of your tests may not lead to a winning outcome. In a previous role, we were able to increase sales conversion on our digital mobile platform by over 30%. That increase was the result of 15-20 tests that took the better part of a year to accomplish.

To maintain objectivity:

Follow a rigorous and objective methodology.
Use a hypothesis testing framework to structure your experiments.
Avoid making decisions based on emotions or intuition.
Run brainstorming sessions to provide and receive constructive criticism.

Dealing with Failed Tests

As noted above, not every A/B test will yield positive results. Dealing with failed tests can be discouraging. However, it's important to remember that even "failed" tests provide valuable insights. Failed tests are valuable because they are not rolled out to your entire population of users and can be reverted quickly. These failed tests also give you some understanding of what not to try in future tests. In one online sales flow I tested, there were a series of interstitial steps that informed a customer of what was coming next and reassured them they were on the correct path to accomplish their task. Our groups' common impression was that simplifying the purchase path by removing these additional pages would generate increase conversion dates. In test after test this concept was disproven. This segment of users was happier with the existing path. Sometimes you must listen to the data and stay the course.

To handle this challenge:

View failed tests as learning opportunities rather than setbacks.
Analyze the results to understand why the variation didn't perform as expected.
Use the insights gained to inform future hypotheses and tests[3].

Flicker Effect

The flicker effect occurs when the original page appears briefly before the variation loads, potentially skewing results and negatively impacting user experience. This a problem with many AB testing tools that accept incoming traffic, determine if that user should be in a test, determine which variant of the test is appropriate to the user, and then serve up the correct variation. Some tools mitigate the effect by holding the page load until this determination can be made. Holding the page load time can reduce "flicker" but increased user frustration. AB testing tools like Sitespect use a proxy architecture to overcome this issue but may require additional development efforts to integrate.

To mitigate this issue:

Ensure proper script installation on your webpage.
Address any page loading speed issues.
Choose an A/B testing tool that minimizes the flicker effect.

Testing Too Many Elements Simultaneously

Testing multiple elements at once can make it difficult to determine which changes are responsible for the observed results. Very often, in order to prove the value of a team or testing program, there is pressure to show significant improvement very quickly. This can lead to testing experiences that are vastly different from the existing experience. At times, such a methodology can produce positive outcomes. Unfortunately, it is difficult to understand what in the new design or process was instrumental in the success or failure of the test.

To avoid this pitfall:

Focus on testing one or a few key elements at a time.
Prioritize elements that are likely to have the most significant impact on your goals.
Use multivariate testing when appropriate but be aware of its increased complexity and sample size requirements.
Hold to the test plan defined in the testing strategy to understand the impact of incremental changes.

Balancing Short-term Gains and Long-term Impact

It can be challenging to find the right balance between changes that produce immediate results and those that contribute to long-term success.

To address this:

Consider both short-term metrics and long-term customer experience when designing tests.
Implement additional measures to evaluate customer experience post-test, such as surveys or feedback forms.
Perform A/B testing regularly to track changes in trends and customer behavior over time.

By understanding these common A/B testing challenges and implementing strategies to overcome them, you can improve the effectiveness of your testing program and make more informed decisions based on data-driven insights. Remember that A/B testing is an iterative process, and continuous learning and refinement are key to success.

For inquiries about how we can assist with your AB testing program, please reach out to me at info@chimeradigitalstrategy.com.

Note:

This post was initially written by https://www.perplexity.ai/ and then editted and personalized by me. I am testing the viability of different AI tools in helping to prepare some of the content for this site.

While I found the outline for this article helpful from AI, it did not provide enough detail or specifics. While I would say that I started this article with Perplexity, there is very little in that was not edited. The outline was help to give me a framework but the content certainly needed a lot of editing.