AB Testing Myths - Article

6 mistakes that will derail your optimization program
(and how to not make them)

An in-depth look at how to properly structure your experiments.
Foreword
Recently, I was chatting with a friend of Which is not at all uncommon.

mine – a data analyst for a big name
At Widerfunnel, we speak to so many
direct-to-consumer brand.
optimization teams that have gotten stuck
We were talking about experimentation at because they don’t take the time to
her organization: What’s working and not understand experiment design. And their
working in her opinion; what the culture is programs suffer.
like; what her biggest roadblocks are.
Conversion rate lift plateaus or drops.
At her company, the “culture of “Winning” variations actually hurt
experimentation” isn’t an issue. conversions. “Losing” variations are
Leadership fully supports actually winners. Results spike for a few
experimentation, and the optimization weeks, but then return to baseline.
teams have a lot of freedom to test. Experiment results don’t compound.
Important customer learnings are missed.
But she suggested that there is a general
lack of understanding around design of But here’s the good news: It doesn’t have
experiments that may be holding their to be like this.
optimization efforts back.
2
Foreword
In this guide, we outline six of the worst

mistakes optimization teams make when
it comes to design of experiments. And we
show you the right approach to take.
Based on more than a decade of

experience building optimization
programs for companies like HP, Asics,
Square, JBL Harman, DELUXE, and more,
this is one whitepaper you really should
read.
– Natasha Wahid, Marketing Manager at Widerfunnel
3
Mistake #1: Sacrificing quality for velocity
There are serious benefits to launching as all of the most relevant goals to generate
many experiments as possible, as quickly maximum insights, and QA all variations
as possible. Arguably, the highest to ensure bugs won’t skew your data.
probability path to a truly amazing idea is
to test as quickly as possible. “An emphasis on velocity can create
mistakes that are easily avoided when
But marketers often equate optimization
you spend more time on preparation.”
success solely with experiment velocity,
– Dennis Pavlina, Optimization Strategist,
without challenging the quality of the
Widerfunnel
tests they’re running. If you are
sacrificing rigor for speed, you most likely Another problem with a “velocity above
won’t get the results you’re hoping for in quality” mindset: If you decide to test
the long run. many ideas, quickly, you are sacrificing
your ability to really validate and leverage
When the focus is totally on speed, you
a single idea. One winning experiment
often spend less time structuring your
may mean quick conversion rate lift, but it
tests, and you will miss out on insights.
doesn’t mean you’ve explored the full
With every test, you should directly
potential of that idea.
address a hypothesis. You should track
5
You can often apply the insights gained In an insight-driving experiment, the
from one experiment, when building out primary purpose of your test is to answer
the strategy for another experiment. Plus, a question, and lifting conversion rates is a
those insights provide additional evidence secondary goal.
for testing a particular concept. Lining up
With this type of test, you’re setting out on
a huge list of experiments at once without
a quest for your unicorn insight.
taking into account these past insights can
result in your testing program being more Unicorn insights are the ideas that aren’t
scattershot than evidence-based. applicable to any other business. You can’t
borrow them from industry-leading
Don’t miss out on testing to learn
websites, and they’re not ideas a
A mis-guided emphasis on velocity in competitor can steal.
pursuit of conversion rate lift also limits
Your unicorn insight is unique to your
your team in terms of what they can test.
business. It could be finding that magic
Optimizers are so often under pressure to
word that helps users convert all over
satisfy the speed side of the equation that
your site, or discovering that key value
they sacrifice testing to learn.
proposition that keeps customers coming
back.
6
Every business has a unicorn insight, but We can then rank the importance of every
you are not going to find it by testing in element on the page and start to leverage
your regular wheelhouse. the things that seem to be important to
users on this webpage on other areas of a
We sometimes run a test for our clients
site.
where we take a webpage and isolate,
removing every section of that page Rather than guessing at what we think
individually. Are we expecting this test to users are going to respond to best, we run
deliver a big lift? Nope, but we are an insight-driving test and let the users
expecting this test to teach us something. give us the insights that can then be
applied all over a site.
We know that this is the fastest possible
way to answer the question “What do And when you don’t allow for ‘down-time’
users care about most on this page?” After in your optimization efforts – the time
this type of experiment, we suddenly have during which you restock the cupboard
a lot of answers to our questions. with ideas, and put those ideas into your
testing piggy-bank – your program will hit
That’s right: no lift, but we have insights
a wall.
and clear next steps.
7
The right approach: This approach celebrates both winning

and losing experiments for their ability to
“The key to sustainable A/B testing drive your team’s learnings. So that they
output, is to find a balance between become stronger at designing
short-term (maximum testing speed), experiments that have an impact on your
and long-term (testing for business.
data-collection and insights).” – Michael
St Laurent, Senior Optimization Strategist, Ultimately, insights will fuel your
Widerfunnel experimentation program and will help
your team come up with stronger
The right approach to experimentation is
hypotheses for future testing.
to balance your short-term and long-term
testing goals. Obviously, lift is essential to
any optimization program. But you should
balance short-term revenue gains against
experiments that are designed to produce
insights, win or lose.
8
Mistake #2: Misunderstanding experiment design
Optimizers and marketers often talk about A/B test

“A/B testing” without understanding the
In an A/B test, you are testing your original
different methods for structuring
page or experience (A) against a single
experiments. A/B testing is one type of
variation (B) to see which will result in a
experiment design. You can also consider
higher conversion rate.
A/B/n experimentation, multivariate
testing, and factorial design. Each has Variation B might feature a multitude of
pro’s and con’s, strengths and changes (i.e. a ‘cluster’) of changes, or an
weaknesses. isolated change.
As you test, you will find that different In an A/B/n test, you are testing more than
business goals and circumstances will two variations of a page at once. “N” refers
require different test design. How you to the number of versions being tested,
structure your tests will have a major anywhere from two versions to the “nth”
impact on results and learnings gained. version.
Here’s an overview of the different types

of experimentation to consider:
10
In this wireframe,
variation B
features a cluster
of changes: hero
image, layout,
and headline.
VarA VarB
11
Multivariate test (MVT) Hypothetically, let’s assume that each

With multivariate testing, you are testing change has the following impact on your
each individual change, isolated one conversion rate:
against another, by mixing and matching
every possible combination available. ● Change A = +10%
● Change B = +5%
Imagine you want to test a homepage ● Change C = -25%
redesign with four changes in a single ● Change D = +5%
variation:
If you were to run a classic A/B test―your
● Change A: New hero banner current control page (A) versus a
● Change B: New call-to-action (CTA) combination of all four changes at once
copy (B)―you would get a hypothetical decrease
● Change C: New CTA color of -5% overall (10% + 5% – 25% +5%). You
● Change D: New value proposition would assume that your redesign did not
statement work and most likely discard the ideas.
12
With a multivariate test, however, each of

the following would be a variation:
13
Multivariate testing is great because it “The more variations you test, the
shows you the positive or negative impact more your traffic will be split while
of every single change, and every single
testing, and the longer it will take for
combination of every change, resulting in
the most ideal combination (in this your tests to reach statistical
theoretical example: A + B + D). significance. Many companies simply
can’t follow the principles of MVT
However, this strategy is close to
impossible in the real world. Even if you
because they don’t have enough
have a ton of traffic, it would still take traffic.” – Alhan Keser, Sr. Manager, Digital
more time than most marketers have for a Conversion Optimization, American Express
test with 15 variations to reach any kind of
statistical significance.
Factorial design allows for the speed of

pure A/B testing combined with the
insights of multivariate testing.
14
Factorial design: The middle ground With basic factorial experiment design,
you could set up the following variations
Factorial design is another method of
in our hypothetical example:
Design of Experiments. Similar to MVT,
factorial design allows you to test more VarA: Change A = +10%
than one element change within the same
VarB: Change A + B = +15%
variation.
VarC: Change A + B + C = -10%
The greatest difference is that factorial
design doesn’t force you to test every VarD: Change A + B + C + D = -5%
possible combination of changes.
Rather than creating a variation for every
combination of changed elements (as you
would with MVT), you can design your
experiment to focus on specific isolations
that you hypothesize will have the biggest
impact.
15
In this basic example, variation A

features a single change; VarB is
built on VarA, and VarC is built on
VarB.
16
Note: With factorial design, estimating the Given the above information, you would
value (e.g. conversion rate lift) of each estimate that change A is worth a 10% lift
change is a bit more complex than shown by comparing the 11% conversion rate of
previously. variation A against the 10% conversion
rate of your control.
Firstly, let’s imagine that our control page
has a baseline conversion rate of 10% and The estimated conversion rate lift of
that each variation receives 1,000 unique change A = (11 / 10 – 1) = 10%
visitors during your test.
When you estimate the value of change A,

you are using your control as a baseline.
17
But, when estimating the value of change

B, variation A must become your new
baseline.
The estimated conversion rate lift of

change B = (11.5 / 11 – 1) = 4.5%
As you can see, the ‘value’ of change B is

slightly different from the 5% difference
shown above.
18
When you structure your tests with Factorial design allows you to get the best
factorial design, you can work backwards potential lift, with five total variations in
to isolate the effect of each individual two tests, rather than 15 variations in a
change by comparing variations. But, in single multivariate test.
this scenario, you have four variations
But, wait…
instead of 15.
It’s not always that simple. How do you
“We are essentially nesting A/B tests
hypothesize which elements will have the
into larger experiments so that we can biggest impact? How do you choose which
still get results quickly without changes to combine and which to isolate?
sacrificing insights gained by
isolations.” – Michael St Laurent
19
The Strategist’s exploration At Widerfunnel, we don’t just invest in the

rigorous training of our Strategists. We
The answer lies in the Explore (or research
also have a 10-year-deep test archive that
gathering) phase of your testing process.
our Strategy team continuously draws
At Widerfunnel, Explore is an expansive upon when determining which changes to
thinking zone, where all options are cluster, and which to isolate.
considered. Ideas are informed by your
business context, persuasion principles,
digital analytics, user research, and your
past test insights and archive.
Experience is the other side to this coin. A

seasoned optimization strategist can look
at the proposed changes and determine
which changes to combine (i.e. cluster),
and which changes should be isolated due
to risk or potential insights to be gained.
20
Factorial design in action: A case study
21
Mistake #2: A case study
Annie Selke is a multi-faceted company that sells luxury homeware goods,

including home textiles, rugs and luxury linens. We worked with the Annie Selke
team over the course of a year to improve conversion rates across their site
(which includes five different brands).
This story follows two experiments we ran on Annie Selke’s product category
page. In the first experiment, we tested three variations against the control.
22
The original product

category page – our
control.
23
The first variation – variation A – featured an isolated change to the ‘Sort By’
filters below the image, making it a drop down menu.
Evidence?
This change was informed by qualitative click map data, which showed low
interaction with the original filters. Strategists also theorized that, without
context, visitors may not even know that these boxes are filters (based on
e-commerce best practices). This variation was built on the control.
(Screenshot on the next page).
24
In variation A, we
replaced the original
‘Sort By’ categories
with a more
traditional drop-down
menu.
25
Variation B was also built on the control, and featured another isolated change
to reduce the left navigation.
Evidence?
Click map data showed that most visitors were clicking on “Size” and “Palette”,
and past testing had revealed that Annie Selke visitors were sensitive to
removing distractions. Plus, the persuasion principle, known as the Paradox of
Choice, theorizes that more choice = more anxiety for visitors.
(Screenshot on the next page).
26
Variation B featured a
reduced left-hand
navigation.
27
Unlike variation B, variation C was built on variation A rather than on the control,
and featured a final isolated change: a collapsed left navigation.
Evidence?
This variation was informed by the same evidence as variation B.
(See screenshot on the next page).
28
Variation C featured
a collapsed left-hand
filter (built on
variation A).
29
Results The follow-up experiment
● Variation A (built on the control) The follow-up test also featured three
saw a decrease in transactions of variations versus the original control.
-23.2%. Because, you should never waste the
● Variation B (built on the control) opportunity to gather more insights!
saw no change.
● Variation C (built on variation A) saw Variation A was our validation variation. It
a decrease in transactions of featured the collapsed filter (change C)
-1.9%. from the first experiment’s variation C, but
maintained the original ‘Sort By’
But wait! Because variation C was built on
functionality from the first experiment’s
variation A, we knew that the estimated
control.
value of change C (the collapsed filter),
was 19.1%. (See screenshot on the next page).
The next step was to validate our

estimated lift of 19.1% in a follow up
experiment.
30
Variation A featured
the collapsed filter &
original ‘Sort By’
functionality.
31
Variation B was built on variation A, and featured two changes emphasizing

visitor fascination with colors. We 1) changed the left nav filter from “palette” to
“color”, and 2) added color imagery within the left nav filter.
Evidence?
Click map data suggested that Annie Selke visitors are most interested in
refining their results by color, and past test results also showed visitor
sensitivity to color.
32
In variation B, we
updated “palette” to
“color”, and added
color imagery. (A
variation featuring
two clustered
changes).
33
Variation C was built on variation A, and featured a single isolated change: we

made the collapsed left nav persistent as the visitor scrolled.
Evidence?
Scroll maps and click maps suggested that visitors want to scroll down the page,
and view many products.
34
In variation C, the
collapsed filter was
persistent.
35
Results This is what factorial design looks like in

action: big wins and big insights, informed
Variation A led to a 15.6% increase in
by human intelligence.
transactions, which is very close to our
estimated 19% lift, validating the value of
the collapsed left navigation!
Variation B was the big winner, leading to

a 23.6% increase in transactions. Based
on this win, we could estimate the value of
the emphasis on color.
Variation C resulted in a 9.8% increase in

transactions, but because it was built on
variation A (not on the control), we
learned that the persistent left navigation
was actually responsible for a decrease in
transactions of -11.2%.
36
The right approach: If you have lots and lots of traffic, and
value insights above everything,
The right approach to design of multivariate may be for you.
experiments is to ensure your team
actually understands the nuances If you want the growth-driving power of
between the different testing types. pure A/B testing, as well as insightful
takeaways about your customers, you
Both you and your team should be should explore factorial design.
educated on the options available to you
in experimentation, and should always be A note of encouragement: With factorial
able to link your experiments and design, your tests will get better as you
experiment structure to overarching continue to test. With every test, you will
business goals. learn more about how your customers
behave, and what they want. Which will
If you are in a situation where potential make every subsequent hypothesis
revenue gains outweigh the potential smarter, and every test more impactful.
insights to be gained or your test has little
long-term value, you may want to go with
a standard A/B cluster test.
37
Mistake #3: Obsessing over statistical significance
Statistical significance tells us whether or In actuality, you cannot view it within a

not there is a true difference between a silo.
variation and the control (i.e. that the
For example, during a recent experiment,
difference is not due to chance). To assess
the testing tool used reported statistical
the likelihood, we use a metric called
significance just three hours after the
Significance Level (or Confidence Level).
experiment went live. But, in three hours,
At Widerfunnel, we recommend testing to we knew we had not gathered a
a 95% significance level whenever representative sample size.
possible. If a winning variation is 95%
After 24 hours, the same experiment and
statistically significant, you can be 95%
testing tool reported a confidence level of
confident that the results are not caused
just 88%. That’s how quickly a confidence
by randomness (a false positive).
level can change if you are not taking into
But statistical significance is not the account sample size.
end-all, be-all. It is important, but
marketers often talk about it as if it is the
only determinant for completing an A/B
test.
39
“You should not wait for a test to be For an e-commerce business, a full
significant (because it may never business cycle is typically a one-week
period; for subscription-based businesses,
happen) or stop a test as soon as it is
this might be one month or longer.
significant. Instead, you need to wait
for the calculated sample size to be You don’t have to run a test until it reaches statistical
significance
reached before stopping a test. Use a
test duration calculator to understand As any optimizer knows, this may never
better when to stop a test.” – Claire happen. And it doesn’t mean you should
Vignon, Former Director of Strategy at walk away from an A/B test, completely.
Widerfunnel With testing experience, an expert
Traffic behaves differently over time for all understanding of your tool, and by
businesses, so you should always run a observing the factors I’m about to outline,
test for full business cycles, even if you you can discover actionable insights that
have reached statistical significance. This are directional (directionally true or false).
way, your experiment has taken into
account all of the regular fluctuations in
40 traffic that impact your business.
You can use these factors to make the ● Relativity: If my testing tool uses
decision that makes the most sense for t-test to determine significance, am
your business: implement the variation I looking at the hard numbers of
based on the observed trends, abandon actual conversions in addition to
the variation based on observed trends, conversion rate? Does the
and/or create a follow-up test! calculated lift make sense?
● Results stability: Is the conversion ● LIFT & ROI: Is there still potential
rate difference stable over time, or for the experiment to achieve X%
does it fluctuate? Stability is a lift? If so, you should let it run as
positive indicator. long as it is viable, especially when
considering the ROI.
● Experiment timeline: Did I run this
experiment for at least a full ● Impact on other elements: If
business cycle? Did conversion rate elements outside the experiment
stability last throughout that cycle? are unstable (social shares, average
order value, etc.) the observed
conversion rate may also be
unstable.
41
Check your graphs! Are

conversion rates crossing?
Are the lines smooth and
flat, or are there spikes
and valleys?
42
The right approach:
The right approach to completing an

experiment with confidence is to ensure
that you and your team are able to
understand and consider all of the
different factors that can act as predictors
for whether or not your hypothesis is true
or false.
While statistical significance should

certainly be one criteria to determine
whether your experiment is a winner or a
loser, you should also consider other
factors: results stability, timeline, relativity,
revenue lift or return on investment, and
finally the impact of other elements.
43
Mistake #4: Testing too many variations at once
Mistake #4: Testing too many variations
We see this over and over again. A team The issue with running a longer test is that
has lots of ideas for variations on one you are more likely to be exposed to
page or for one element. And they want to cookie deletion. If you run an A/B test for
try them all. more than 3 to 4 weeks, the risk of sample
pollution increases: In that time, people
But it’s a risky move. Having too many
will have deleted their cookies and may
variations slows down your tests and,
enter a different variation than the one
more importantly, it can impact the
they were originally in.
integrity of your data in a couple of ways.
“Within 2 weeks, you can get a 10%
First, the more variations you test against
each other, the more traffic you will need, dropout of people deleting cookies
and the longer you’ll have to run your test and that can really affect your sample
to get results that you can trust. This is quality.” – Ton Wesseling, Founder, Online
simple math. Dialogue
45
The second risk when testing multiple Google’s 41 shades of blue is a good
variations is that the significance level example of this. In 2009, when Google
goes down as the number of variations could not decide which shades of blue
increases. would generate the most clicks on their
search results page, they decided to test
For example, if you use the accepted
41 shades. At a 95% confidence level, the
significance level of 0.05 and decide to test
chance of getting a false positive was 88%.
20 different scenarios, one of those will be
If they had tested 10 shades, the chance
significant purely by chance (20 * 0.05). If
of getting a false positive would have been
you test 100 different scenarios, the
40%, 9% with 3 shades, and down to 5%
number goes up to five (100 * 0.05).
with 2 shades.
In other words, the more variations, the
This is called the Multiple Comparison
higher the chance of a false positive i.e.
Problem.
the higher your chances of finding a
winner that is not significant.
46
The following table summarizes the

multiple comparison problem:
Probability of a false positive with a 0.05 significance level.
# of hypotheses or variations Probability of a false positive
1 5.0%
2 9.8%
5 22.6%
8 33.7%
10 40.1%
41 87.8%
47
Adjusted significance and confidence

levels to maintain a 5% false discovery
probability:
#of hypotheses or
Required significance level Required confidence level
variations
1 0.05 95.0%
2 0.025 97.5%
5 0.01 99.0%
8 0.00625 99.4%
10 0.005 99.5%
41 0.001 99.9%
48
The same problem also applies when you Some experimentation tools, such as VWO
test multiple goals and segments. and Optimizely, adjust for the multiple
comparison problem. These tools will
“Each additional variation and goal make sure that the false positive rate of
adds a new combination of individual your experiment matches the false
statistics for online experiments positive rate you think you are getting.
comparisons to an experiment. In a In other words, the false positive rate you
scenario where there are four set in your significance threshold will
variations and four goals, that’s 16 reflect the true chance of getting a false
potential outcomes that need to be positive: you won’t need to correct and
adjust the confidence level.
controlled for separately.” – Optimizely
Practical Guide to Stats
49
One final problem with testing multiple

variations can occur when you are
analyzing the results of your experiment.
You may be tempted to declare the

variation with the highest lift the winner,
even though there is no statistically
significant difference between the winner
and the runner up.
This means that, even though one

variation may be performing better in the
current test, the runner up could “win” in
the next round. You should consider both
variations as winners.
50
The right approach: Prioritizing where you invest energy will

give you better returns by emphasizing
The right approach to determining how
the questions and variations that are
many variations to test at one time is to
more important to your business.
make sure you have prioritized your ideas
using a framework, like the PIE Encourage your team to come up with as
prioritization framework. many ideas as possible, but cap your
variations per experiment based on what’s
You can’t test everywhere or everything at
appropriate for your business and site
once. With limited time and resources
traffic.
and, most importantly, limited traffic to
allocate to each variation, you must This will most likely mean saving some of
evaluate each idea: Is it based on evidence your ideas for future tests. And creating
from analytics? Past experiment results? clustered and isolated changes using
Psychological principles? Customer factorial design, based on the hypothesis
research? you are trying to confirm or reject.
51
Mistake #5: Changing experiment settings mid-test
When you launch an experiment, you Ronny Kohavi from Microsoft shares an
need to commit to it fully. Do not change example wherein a website gets one
the experiment settings, the test goals, the million daily visitors on both Friday and
design of the variation or of the control Saturday. On Friday, 1% of the traffic is
mid-experiment. And don’t change traffic assigned to the treatment (i.e. the
allocations to variations. variation), and on Saturday that
percentage is raised to 50%.
Changing the traffic split between
variations during an experiment will (See table on the next page).
impact the integrity of your results
because of a problem known as Simpson’s
Paradox.
This statistical paradox appears when

there is a trend in different groups of data
which disappears when those groups are
combined.
53
Table 1: Conversion rate for two days. Each day has 1M customers, and the
treatment (T) is better than Control (C) on each day, yet worse overall
Friday Saturday
Total
C/T split: 99% / 1% C/T split: 50% / 50%
C 20,000 / 990,000 = 2.02% 5,000 / 500,000 = 1.00% 25,000 / 1,490,000 = 1.68%
T 230 / 10,000 = 2.30% 6,000 / 500,000 = 1.20% 6,230 / 510,000 = 1.20%
Source: Seven Pitfalls to Avoid when Running Controlled Experiments on the Web
54
Changing the traffic allocation mid-test will In our current example, this means that
also skew your results because it alters the returning visitors will still be assigned
the sampling of your returning visitors. to the control and you will now have a
This change only affects new users. Once large proportion of returning visitors (who
visitors are bucketed into a variation, they are more likely to convert) in the control.
will continue to see that variation for as
long as the experiment is running. Note: This problem of changing traffic
allocation mid-test only happens if you make
Let’s say you start a test by allocating 80%
a change at the variation level. You can
of your traffic to the control and 20% to
change the traffic allocation at the
the variation. After a few days you change
experiment level mid-experiment. This is
it to a 50/50 split. All new users are
useful if you want to have a ramp up period
allocated accordingly from then on. But all
where you target only 50% of your traffic for
the users that entered the experiment
the first few days of a test before increasing
prior to the change will be bucketed into
it to 100%. This won’t impact the integrity of
the same variation they entered
your results.
previously.
55
The “do not change mid-test rule” extends Track it and stick to it.
to your test goals and the designs of your
It is useful to track other key metrics to
variations. If you’re tracking multiple goals
gain insights and/or debug an experiment,
during an experiment, you may be
if something looks wrong. However, these
tempted to change what the main goal
are not the metrics you should look at to
should be mid-experiment.
make a decision, even though they may
Don’t do it. favor your favorite variation.
All optimizers have a favorite variation

that we secretly hope will win during any
given test. This is not a problem until you
start giving weight to the metrics that
favor this variation.
Decide on a goal metric that you can

measure in the short term (the duration of
a test) and that can predict your success in
the long term.
56
The right approach: You and your team set experiment

success metrics beforehand as a way of
The right approach is to commit to your
ensuring that your own biases do not
experiments fully. Take care to set up your
affect the results. Part of your job is to
experiments properly, and allow a test to
evaluate how you might be affecting an
run its course so you don’t affect the
experiment.
integrity of your results.
This commitment will lead to stronger and
If your data is skewed, the outcome won’t
more effective design of experiments. And
mean anything. Any results – good or bad
will help you develop your skills and
– achieved by the experiment is void
expertise as an optimizer.
because of a mistake in how the
experiment was designed and / or run.
Understand that when your experiments

are not structured properly, you are
wasting valuable time and resources.
57
Mistake #6: Evaluating an experiment on conversion rate lift alone
Some Optimizers believe that an But that doesn’t mean you shouldn’t track
experiment is only as good as its effect on other relevant metrics. At Widerfunnel, we
conversion rates. set up as many relevant secondary goals
(clicks, visits, field completions, etc.) as
Well, if conversion rate is the only success possible for each experiment.
metric you are tracking, this may be true.
But you’re underestimating the true “This ensures that we aren’t just
growth potential of A/B testing if that’s gaining insights about the impact a
how you structure your tests! variation has on conversion rate, but
To clarify: Your main success metric also the impact it’s having on visitor
should always be linked to your biggest behavior.” – Dennis Pavlina
revenue driver. (Unless you’re conducting
an insight-driving experiment, where
learning is the only goal).
59
When you observe secondary goal

metrics, your experiments become
exponentially more valuable because each
one generates a wide range of secondary
insights.
These can be used to create follow up

experiments, identify pain points, and
create a better understanding of how
visitors move through your site.
60
An example
61
Mistake #6: An example
One of our clients provides an online We tested 3 variations against the original,
consumer information service — visitors and each resulted in increased
type in a question and get an Expert transactions.
answer. This client has a 4-step funnel.
The secondary goals revealed important
With every test we run, we aim to increase
insights about visitor behavior, though.
transactions: the final, and most
Firstly, each variation resulted in
important conversion.
substantial drop-offs from step 1 to step
But, we also track secondary goals, like 2: Fewer people were entering the funnel.
click-through-rates, and But, from there, we saw gradual increases
refunds/chargebacks, so that we can in clicks to steps 3 and 4.
observe how a variation influences visitor
behavior.
In one experiment, we made a change to

step one of the funnel (the landing page).
Our goal was to set clearer visitor
expectations at the beginning of the
purchasing experience.
62
Our variations seemed to be filtering out visitors without strong purchasing

intent. We also saw an interesting pattern with one of our variations: It
increased clicks from step 3 to step 4 by almost 12% (a huge increase), but
decreased actual conversions by -1.6%.
This result was evidence that the call-to-action on step 4 was extremely weak
(which led to a follow-up test).
Funnel analysis: % change in click-through-rate vs. the control
Step 1 to Step 2 Step 2 to Step 3 Step 3 to Step 4 Step 4 to Conversion
Var A -5.98% +1.02% +2.21% +9.32%
Var B -4.39% +0.40% +0.72% +12.92%
Var C -2.05% -0.88% +11.46% -1.60%
63
We also saw large decreases in refunds

and chargebacks for this client, which
further supported the idea that the right
visitors (i.e. the wrong visitors) were the
ones who were dropping off.
This is just a taste of what every

experiment could be worth to your
business. The right goal tracking can
unlock piles of insights about your target
visitors.
64
While many companies use a metric like
'% test wins' to evaluate their optimization
The right approach:
program, the neutrality metric places a big
If you hyper-focus on conversion rates, emphasis on winning and losing
you may miss out on valuable learnings. A experiments. Within this organization,
better approach is to to create an experiment results can be categorized as
experimentation culture that celebrates follows:
failure.
● Wins: Experiments that are
Instead of criticizing tests for their inability trending positively or show
to produce bottom-line results, design statistically significant conversion
your experiments so that each one reveals rate lift.
something about your customers. So you ● Neutral/Inconclusive:
can dive deep into the learnings. Experiments that have little or no
One Widerfunnel client celebrates both noticeable or significant change
winning and losing experiments for their against the original ("Control").
ability to provide insights, using a ● Losses: Experiments that are
neutrality metric to gage experiment trending negatively or show
success. statistically significant decrease.
65
An inconclusive experiment is treated as A neutrality metric can be improved by

the worst case scenario because there is proper use of design of experiments, solid
no impact on the primary conversion rate hypotheses, and a good understanding of
that can be evaluated for learnings. your customers to develop the types of
hypotheses they will respond to.
With the test neutrality ratio, instead of
calculating the metric for % of test wins,
you are calculating the % of neutral or
inconclusive tests that do not hit
significance positively or negatively.
The goal of this metric is to reduce the

number of tests that come out
inconclusive or neutral. Because mature
experimentation organizations value
‘losing’ experiments as much as they value
winning experiments due to their
potential for insights.
66
Learn from others’ mistakes.
Learn from others’ mistakes
Today, tools and technology allow you to But your experimentation program will
track almost any marketing metric. never gain traction if you don’t emphasize
Meaning, you have an endless sea of the need for proper design of
evidence that you can use to generate experiments.
ideas on how to improve your digital
In this resource, we’ve outlined the 6 most
experiences.
common mistakes organizations make
Which makes experimentation more when designing experiments. Don’t fall
important than ever. into these traps. It’s up to you to make
sure your team is putting in the work.
An experiment shows you, objectively,
whether or not one of your many ideas
will actually increase conversion rates and
revenue. And it shows you when an idea
doesn’t align with your user expectations
and will hurt your conversion rates.
68
About Widerfunnel
Widerfunnel is a growth agency, specializing in experimentation, customer

experience, and personalization. We partner with the world's leading brands to
uncover unique insights about their customers, and validate those insights
through real world experimentation. Our team of experts doesn’t just consult
and give advice. We test every recommendation in the real world to prove its
value and gain proven insights.
“The Widerfunnel team shatters the misconception around ‘losing’ experiments

by extracting valuable insights from each experiment and refining strategy in
response to data. Where haphazard conversion optimization programs and
marketers alike flock back to the drawing board after a ‘loser’, Widerfunnel
capitalizes on the insights gained.”
Nate Wallingsford
Director of Conversions, Acquisition Marketing
The Motley Fool
69
Would you like to get your experimentation
program on the right track?
Find out how Widerfunnel can help you get more leads, more
transactions, and more revenue—fast!
+1 (604) 800-6450
www.widerfunnel.com
Contact Widerfunnel now
70

AB Testing Myths - Article

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AB Testing Myths - Article

Uploaded by

Copyright:

Available Formats

6 mistakes that will derail your optimization program

(and how to not make them)

Recently, I was chatting with a friend of Which is not at all uncommon.

In this guide, we outline six of the worst

Based on more than a decade of

– Natasha Wahid, Marketing Manager at Widerfunnel

The right approach: This approach celebrates both winning

Optimizers and marketers often talk about A/B test

Here’s an overview of the diﬀerent types

Multivariate test (MVT) Hypothetically, let’s assume that each

With a multivariate test, however, each of

Factorial design allows for the speed of

In this basic example, variation A

When you estimate the value of change A,

But, when estimating the value of change

The estimated conversion rate lift of

As you can see, the ‘value’ of change B is

The Strategist’s exploration At Widerfunnel, we don’t just invest in the

Experience is the other side to this coin. A

Annie Selke is a multi-faceted company that sells luxury homeware goods,

The original product

(Screenshot on the next page).

(Screenshot on the next page).

This variation was informed by the same evidence as variation B.

(See screenshot on the next page).

Results The follow-up experiment

The next step was to validate our

Variation B was built on variation A, and featured two changes emphasizing

(See screenshot on the next page).

Variation C was built on variation A, and featured a single isolated change: we

(See screenshot on the next page).

Results This is what factorial design looks like in

Variation B was the big winner, leading to

Variation C resulted in a 9.8% increase in

Statistical signiﬁcance tells us whether or In actuality, you cannot view it within a

Check your graphs! Are

The right approach:

The right approach to completing an

While statistical signiﬁcance should

The following table summarizes the

# of hypotheses or variations Probability of a false positive

Adjusted signiﬁcance and conﬁdence

One ﬁnal problem with testing multiple

You may be tempted to declare the

This means that, even though one

The right approach: Prioritizing where you invest energy will

This statistical paradox appears when

C 20,000 / 990,000 = 2.02% 5,000 / 500,000 = 1.00% 25,000 / 1,490,000 = 1.68%

T 230 / 10,000 = 2.30% 6,000 / 500,000 = 1.20% 6,230 / 510,000 = 1.20%

All optimizers have a favorite variation

Decide on a goal metric that you can

The right approach: You and your team set experiment

Understand that when your experiments

When you observe secondary goal

These can be used to create follow up

In one experiment, we made a change to

Our variations seemed to be ﬁltering out visitors without strong purchasing

Funnel analysis: % change in click-through-rate vs. the control

Step 1 to Step 2 Step 2 to Step 3 Step 3 to Step 4 Step 4 to Conversion

Var A -5.98% +1.02% +2.21% +9.32%

Var B -4.39% +0.40% +0.72% +12.92%

Var C -2.05% -0.88% +11.46% -1.60%

We also saw large decreases in refunds