Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more
Download
Standard view
Full view
of .
Save to My Library
Look up keyword or section
Like this
2Activity
0 of .
Results for:
No results containing your search query
P. 1
Experimenting on the Web_2008

Experimenting on the Web_2008

Ratings: (0)|Views: 7,259|Likes:
Published by Youssef Rahoui

More info:

Published by: Youssef Rahoui on Jan 13, 2009
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF or read online from Scribd
See more
See less

05/10/2014

 
Data Min Knowl Disc (2009) 18:140–181DOI 10.1007/s10618-008-0114-1
Controlled experiments on the web:survey and practical guide
Ron Kohavi
·
Roger Longbotham
·
Dan Sommerfield
·
Randal M. Henne
Received: 14 February 2008 / Accepted: 30 June 2008 / Published online: 30 July 2008Springer Science+Business Media, LLC 2008
Abstract
Thewebprovidesanunprecedentedopportunitytoevaluateideasquicklyusingcontrolledexperiments,alsocalledrandomizedexperiments,A/Btests(andtheirgeneralizations), split tests, Control/Treatment tests, MultiVariable Tests (MVT) andparallel flights. Controlled experiments embody the best scientific design for estab-lishing a causal relationship between changes and their influence on user-observablebehavior. We provide a practical guide to conducting online experiments, where end-users can help guide the development of features. Our experience indicates that sig-nificant learning and return-on-investment (ROI) are seen when development teamslisten to their customers, not to the Highest Paid Person’s Opinion (HiPPO). We pro-vide several examples of controlled experiments with surprising results. We reviewthe important ingredients of running controlled experiments, and discuss their limita-tions (both technical and organizational). We focus on several areas that are criticalto experimentation, including statistical power, sample size, and techniques for vari-ance reduction. We describe common architectures for experimentation systems andanalyze their advantages and disadvantages. We evaluate randomization and hashingtechniques,whichweshowarenotassimpleinpracticeasisoftenassumed.Controlled
Responsible editor: R. Bayardo.R. Kohavi (
B
)
·
R. Longbotham
·
D. Sommerfield
·
R. M. HenneMicrosoft, One Microsoft Way, Redmond, WA 98052, USAe-mail: ronnyk@microsoft.comR. Longbothame-mail: rogerlon@microsoft.comD. Sommerfielde-mail: dans@microsoft.comR. M. Hennee-mail: rhenne@microsoft.com
 123
 
Controlled experiments on the web 141
experiments typically generate large amounts of data, which can be analyzed usingdataminingtechniquestogaindeeperunderstandingofthefactorsinfluencingtheout-come of interest, leading to new hypotheses and creating a virtuous cycle of improve-ments. Organizations that embrace controlled experiments with clear evaluation cri-teria can evolve their systems with automated optimizations and real-time analyses.Based on our extensive practical experience with multiple systems and organizations,we share key lessons that will help practitioners in running trustworthy controlledexperiments.
Keywords
Controlled experiments
·
A/B testing
·
e-commerce
·
Website optimization
·
MultiVariable Testing
·
MVT
1 Introduction
One accurate measurement is worth more than a thousand expert opinions
– Admiral Grace HopperInthe1700s,aBritishshipscaptainobservedthelackofscurvyamongsailorsserv-ing on the naval ships of Mediterranean countries, where citrus fruit was part of theirrations.Hethengavehalfhiscrewlimes(theTreatmentgroup)whiletheotherhalf(theControl group) continued with their regular diet. Despite much grumbling among thecrew in the Treatment group, the experiment was a success, showing that consuminglimes prevented scurvy. While the captain did not realize that scurvy is a consequenceof vitamin C deficiency, and that limes are rich in vitamin C, the intervention worked.British sailors eventually were compelled to consume citrus fruit regularly, a practicethat gave rise to the still-popular label
limeys
(Rossi et al. 2003;Marks 2000). Some 300years later, Greg Linden at Amazon created a prototype to showpersonalizedrecommendationsbasedonitemsintheshoppingcart(Linden2006a,b). Youaddanitem,recommendationsshowup;addanotheritem,differentrecommenda-tions show up. Linden notes that while the prototype looked promising, “a marketingsenior vice-president was dead set against it,” claiming it will distract people fromchecking out. Greg was “forbidden to work on this any further.” Nonetheless, Gregran a controlled experiment, and the “feature won by such a wide margin that nothaving it live was costing Amazon a noticeable chunk of change. With new urgency,shoppingcartrecommendationslaunched.Sincethen,multiplesiteshavecopiedcartrecommendations.TheauthorsofthispaperwereinvolvedinmanyexperimentsatAmazon,Microsoft,Dupont, and NASA. The culture of experimentation at Amazon, where data trumpsintuition(Kohavi et al. 2004), and a system that made running experiments easy, allowed Amazon to innovate quickly and effectively. At Microsoft, there are multiplesystems for running controlled experiments. We describe several architectures in thispaper with their advantages and disadvantages. A unifying theme is that controlledexperiments have great return-on-investment (ROI) and that building the appropriateinfrastructure can accelerate innovation. Stefan Thomke’s book title is well suitedhere:
Experimentation Matters
 123
 
142 R. Kohavi et al.
The web provides an unprecedented opportunity to evaluate ideas quickly usingcontrolled experiments, also called randomized experiments (single-factor or facto-rial designs), A/B tests (and their generalizations), split tests, Control/Treatment, andparallel flights. In the simplest manifestation of such experiments, live users are ran-domly assigned to one of two variants: (i) the Control, which is commonly the “exist-ing” version, and (ii) the Treatment, which is usually a new version being evaluated.Metrics of interest, ranging from runtime performance to implicit and explicit userbehaviors and survey data, are collected. Statistical tests are then conducted on thecollecteddatatoevaluatewhetherthereisastatisticallysignificantdifferencebetweenthe two variants on metrics of interest, thus permitting us to retain or reject the (null)hypothesis that there is no difference between the versions. In many cases, drillingdown to segments of users using manual (e.g., OLAP) or machine learning and datamining techniques, allows us to understand which subpopulations show significantdifferences, thus helping improve our understanding and progress forward with anidea.Controlled experiments provide a methodology to reliably evaluate ideas. Unlikeothermethodologies,suchaspost-hocanalysisorinterruptedtimeseries(quasiexper-imentation)(Charles and Melvin 2004), this experimental design methodology tests for causal relationships (Keppel et al. 1992, pp. 5–6). Most organizations have manyideas, but the return-on-investment (ROI) for many may be unclear and the evalua-tion itself may be expensive. As shown in the next section, even minor changes canmake a big difference, and often in unexpected ways. A live experiment goes a longway in providing guidance as to the value of the idea. Our contributions include thefollowing.
In Sect.3we review controlled experiments in a web environment and provide arichsetofreferences,includinganimportantreviewofstatisticalpowerandsamplesize, which are often missing in primers. We then look at techniques for reducingvariancethatwefoundusefulinpractice.Wealsodiscussextensionsandlimitationsso that practitioners can avoid pitfalls.
In Sect. 4, we present several alternatives to MultiVariable Tests (MVTs) in anonline setting. In the software world, there are sometimes good reasons to preferconcurrent uni-variate tests over traditional MVTs.
In Sect.5, we present generalized architectures that unify multiple experimentationsystems we have seen, and we discuss their pros and cons. We show that somerandomization and hashing schemes fail conditional independence tests requiredfor statistical validity.
In Sect.6we provide important practical lessons.When a company builds a system for experimentation, the cost of testing andexperimental failure becomes small, thus encouraging innovation through experimen-tation. Failing fast and knowing that an idea is not as great as was previously thoughthelps provide necessary course adjustments so that other more successful ideas canbe proposed and implemented.
 123

Activity (2)

You've already reviewed this. Edit your review.
1 thousand reads
1 hundred reads

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->