You are on page 1of 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/327282582

Online Evaluation for Effective Web Service Development: Extended Abstract of


the Tutorial at TheWebConf'2018

Thesis · April 2018


DOI: 10.13140/RG.2.2.12362.62400/1

CITATIONS READS

4 469

6 authors, including:

Roman Budylin Alexey V. Drutsa


Yandex Lomonosov Moscow State University
14 PUBLICATIONS   26 CITATIONS    36 PUBLICATIONS   212 CITATIONS   

SEE PROFILE SEE PROFILE

Gleb Gusev Eugene Kharitonov


Yandex FB
45 PUBLICATIONS   422 CITATIONS    34 PUBLICATIONS   236 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

algebraic geometry View project

online evaluation View project

All content following this page was uploaded by Alexey V. Drutsa on 08 September 2018.

The user has requested enhancement of the downloaded file.


Online Evaluation for Effective Web Service Development:
Extended Abstract of the Tutorial at TheWebConf’2018
Roman Budylin Alexey Drutsa Gleb Gusev
Yandex; Moscow, Russia Yandex; Moscow, Russia Yandex; Moscow, Russia
budylin@yandex-team.ru adrutsa@yandex.ru gleb57@yandex-team.ru

Eugene Kharitonov Pavel Serdyukov Igor Yashkov


Criteo Research; Paris, France Yandex; Moscow, Russia Yandex; Moscow, Russia
eugene.kharitonov@gmail.com pavser@yandex-team.ru excel@yandex-team.ru

ABSTRACT overview the state-of-the-art methods underlying the everyday


Development of the majority of the leading web services and soft- evaluation pipelines.
ware products today is generally guided by data-driven decisions At the beginning of this tutorial, we will make an introduction to
based on evaluation that ensures a steady stream of updates, both in online evaluation and will give basic knowledge from mathematical
terms of quality and quantity. Large Internet companies use online statistics (50 min, Section 1). Then, we will share rich industrial
evaluation on a day-to-day basis and at a large scale. The number of experiences on constructing of an experimentation pipeline and
smaller companies using A/B testing in their development cycle is evaluation metrics: emphasizing best practices and common pitfalls
also growing. Web development across the board strongly depends (55 min, Section 2). This will be followed by approaches to develop-
on quality of experimentation platforms. In this tutorial, we will ment of online metrics (55 min, Section 3) and by foundations of
overview state-of-the-art methods underlying everyday evaluation interleaving (55 min, Section 4). A large part of our tutorial will be
pipelines at some of the leading Internet companies. devoted to modern and state-of-the-art techniques (including the
We invite software engineers, designers, analysts, service or prod- ones based on machine learning) that allow to conduct online ex-
uct managers — beginners, advanced specialists, and researchers perimentation efficiently (80 min, Section 5). Finally, we will point
— to join us at The Web Conference 2018, which will take place out open research questions and current challenges that should be
in Lyon from 23 to 27 of April, to learn how to make web service interesting for research scientists.
development data-driven and do it effectively.
1 STATISTICAL FOUNDATION
KEYWORDS We introduce the main probabilistic terms, which form a theoretical
online evaluation; A/B testing; online controlled experiment; inter- foundation of A/B testing. We introduce the observed values as
leaving; industrial practice; online metrics random variables sampled from an unknown distribution. Evalu-
ation metrics are statistics based on observations (mean, median,
quantiles, etc.). Overview of statistical hypothesis testing is pro-
INTRODUCTION vided with definitions of p-value, type I error, and type II error. We
Nowadays, the development of most leading web services and soft- discuss several statistical tests (Student’s t-test, Mann Whitney U,
ware products, in general, is guided by data-driven decisions that and Bootstrap [26]), compare their properties and applicability [25].
are based on online evaluation which qualifies and quantifies the
steady stream of web service updates. Online evaluation is widely
used in modern Internet companies (like search engines [15, 22, 30],
2 EXPERIMENTATION PIPELINE AND
social networks [5, 64], media providers [5], and online retailers) WORKFLOW IN THE LIGHT OF
in permanent manner and on a large scale. Yandex run more than INDUSTRIAL PRACTICE
100 online evaluation experiments per day; Bing reported on more We share rich industrial experiences on constructing of an experi-
than 200 run A/B tests per day [42]; and Google conducted more mentation pipeline in large Internet companies. First, we introduce
than 1000 experiments [30]. The number of smaller companies that the notion of an A/B test (also known as an online controlled exper-
use A/B testing in the development cycle of their products grows iment): it compares two variants of a service at a time, usually its
as well. The development of such services strongly depends on current version (control) and a new one (treatment), by exposing
the quality of the experimentation platforms. In this tutorial, we them to two groups of users [44, 45, 49]. The aim of controlled
experiments is to detect the causal effect of service updates on its
Permission to make digital or hard copies of part or all of this work for personal or performance relying on an Overall Evaluation Criterion (OEC) [45],
classroom use is granted without fee provided that copies are not made or distributed a user behavior metric (e.g., clicks-per-user, sessions-per-user, etc.)
for profit or commercial advantage and that copies bear this notice and the full citation
on the first page. Copyrights for third-party components of this work must be honored.
that is assumed to correlate with the quality of the service. Sec-
For all other uses, contact the owner/author(s). ond, we discuss how can experiments be used for evaluation of
TheWebConf 2018, April 23–27, 2018, Lyon, France changes in various components of web services (e.g., the user inter-
© 2018 Copyright held by the owner/author(s).
ACM ISBN 0. . . $0.00
face [21, 23, 40, 48], ranking algorithms [21, 23, 48, 61], sponsored
https://doi.org/0 search [10], and mobile apps [64]).
TheWebConf 2018, April 23–27, 2018, Roman
Lyon, France
Budylin, Alexey Drutsa, Gleb Gusev, Eugene Kharitonov, Pavel Serdyukov, and Igor Yashkov

Third, we consider several real cases of experiments, where 4 INTERLEAVING FOR ONLINE RANKING
pitfalls [12, 17, 41, 43] are demonstrated and lessons are learned. In EVALUATION
particular, we discuss: conflicting experiments, network effects [28],
The online behavior might vary dramatically across the users and
duration and seasonality [21, 60], logging, and slices. Fourth, we
queries. This can lead to significant noise in the online measure-
provide our management methodology to conduct experiments
ments. The interleaving evaluation is an online evaluation tech-
efficiently and to avoid the pitfalls. This methodology is based on
nique that aims to eliminate these inter-user and inter-query vari-
pre-launch checklists and a team of Experts on Experiments (EE).
abilities in the ranker comparisons. This is achieved by combining
Finally, we discuss how large-scale experimental infrastructure [42,
results from the compared rankers in a single result page (inter-
62, 65] can be used to collect experiments for metric evaluation [19].
leave). After that, the user’s click behavior is analyzed to infer the
preference direction.
The first interleaving method, Balanced Interleaving, was pro-
posed by Joachims et al. [31, 32]. Later, multiple other methods and
extensions were proposed, including Team Draft [55], Probabilistic
3 DEVELOPMENT OF ONLINE METRICS Interleave [29], Optimized Interleaving [53], Multileaving [59], and
We deeply discuss how to build evaluation metrics, what is the main Generalized Team Draft [36].
ingredient in online experimentation pipeline. First, main compo- In this part of the tutorial, we start with discussing the motiva-
nents of a metric are presented: key metric, evaluation statitistic, tion behind the interleaving methods. We introduce and illustrate
statistical significance test, Overall Evaluation Criterion (OEC) [45], the core interleaving algorithms: Balanced Interleave [31], Team
and Overall Acceptance Criterion (OAC) [25]. We show that de- Draft [55], and Probabilistic Interleave [29]. Further, we overview
velopment of a new good metric is a challenging goal, since an some of the available experimental studies [4, 9] that compare in-
appropriate OAC should possess two crucial qualities: direction- terleaving methods with offline and online evaluation approaches.
ality (the sign of the detected treatment effect should align with Finally, we introduce the extension of the interleaving methods
positive/negative impact of the treatment on user experience) and towards comparing multiple rankers, Multileaving [59].
sensitivity (the ability to detect the statistically significant differ- Overall, the interleaving evaluation is a fruitful area of research
ence when the treatment effect exists) [17, 41, 48, 51]. The former with numerous interesting ideas. Unfortunately, in the available
property allows to make correct conclusions on the system quality time frame, we are not able to cover all of them. Hence, we en-
changes [17, 41, 48], while improvement of the latter one allows courage an interested participant to look into other relevant tutori-
to detect metric changes in more experiments and to utilize less als [27, 52, 54, 56].
users [37, 38, 51].
Second, we provide evaluation criteria beyond averages: (a) how
to evaluate periodicity [20, 23, 24] and trends [20, 21, 23] of user be- 5 MACHINE LEARNING DRIVEN A/B
havior over days, e.g., for detection of delayed treatment effects; and TESTING
(b) how to evaluate frequent/rare behavior and diversity in behavior A large part of our tutorial is devoted to modern and state-of-the-art
between users that cannot be detected by mean values [25]. Third, techniques (including the ones based on machine learning) that
product-based aspects in metric building are presented. Namely, improve the efficiency of online experiments. We start this section
we discuss vulnerability of metrics such as a click on a button to with the comparison of randomized experiments and observational
switch a search engine [2] (i.e., how can a metric be gamed or studies. We explain that the difference between averages of the
manipulated); ways to measure different aspects of a service (i.e., key metric may be misleading when measured in an observational
speed [43, 46], absence [8], abandonment [43]); difference between study. We introduce the Neyman–Rubin model and rigorously for-
metrics of user loyalty and ones of user activity [20–22, 25, 47, 57]; mulate implicit assumptions we make each time when evaluating
dwell time to improve click-based metrics [39]; how to evaluate the the results of randomized experiments.
number of user tasks (which can have a complex hierarchy [6]) by Then we overview several studies devoted to the variance reduc-
means of sessions [61]; and issues in session division [34] as well. tion of evaluation metrics. Regression adjustment techniques such
Fourth, math-based approaches to improve metrics are consid- as stratification, linear models [18, 63], and gradient boosted deci-
ered. In particular, we discuss the powerful method of lineariza- sion trees [51] reduce the variance related to the observed features
tion [7] that reduces any ratio metric to the average of a user-level (covariates) of the users.
metric preserving directionality and allowing usage of a wide range We also consider experiments with user experience, where the
of sensitivity improvement techniques developed for user-level met- effect of a service change is heterogeneous (is different for users
rics. We also describe different methods of noise reduction (such as of different types). We overview the main approaches to estima-
capping [43], slicing [14, 61], taking into account user activity, etc.) tion of the heterogeneous treatment effect depending on the user
and of utilization of a user generated content approach. features [3, 50].
Finally, some system requirements for metric building are dis- We explain the Optimal Distribution Decomposition (ODD) ap-
cussed. We explain how to get a set of experiments with verdicts proach that is based on the analysis of the control and treatment
(known positiveness or negativeness), how to construct a pipeline to distributions of the key metric as a whole, and, for this reason,
easily implement and test metrics, and how to measure metrics [19]. is sensitive to more ways the two distributions may actually dif-
fer [48].
Online Evaluation for Effective Web Service Development TheWebConf 2018, April 23–27, 2018, Lyon, France

Method of virtually increasing of the experiment duration through [17] Alex Deng and Xiaolin Shi. 2016. Data-Driven Metric Development for Online
the prediction of the future [22] is discussed. We also provide an- Controlled Experiments: Seven Lessons Learned. In KDD’2016.
[18] Alex Deng, Ya Xu, Ron Kohavi, and Toby Walker. 2013. Improving the sensitivity
other way to improve sensitivity that is based on learning of metric of online controlled experiments by utilizing pre-experiment data. In WSDM’2013.
combinations [35]. This approach showed outstanding sensitivity 123–132.
[19] Pavel Dmitriev and Xian Wu. 2016. Measuring Metrics. In CIKM’2016. 429–437.
improvements in the large scale empirical evaluation [35]. [20] Alexey Drutsa. 2015. Sign-Aware Periodicity Metrics of User Engagement for
Finally, we discuss ways to improve the performance of experi- Online Search Quality Evaluation. In SIGIR’2015. 779–782.
mentation pipeline as a whole. Optimal scheduling of online evalu- [21] Alexey Drutsa, Gleb Gusev, and Pavel Serdyukov. 2015. Engagement Periodicity in
Search Engine Usage: Analysis and Its Application to Search Quality Evaluation.
ation experiments is presented [37] and approaches for early stop- In WSDM’2015. 27–36.
ping of them are highlighted (where inflation of Type I error [33] [22] Alexey Drutsa, Gleb Gusev, and Pavel Serdyukov. 2015. Future User Engagement
and ways to correctly make sequential testing [16, 38] are dis- Prediction and its Application to Improve the Sensitivity of Online Experiments.
In WWW’2015. 256–266.
cussed). [23] Alexey Drutsa, Gleb Gusev, and Pavel Serdyukov. 2017. Periodicity in User
We also highlight important topics not covered by the tutorial: Engagement with a Search Engine and its Application to Online Controlled
Experiments. ACM Transactions on the Web (TWEB) 11 (2017).
Bayesian approaches [13, 16] and non-parametric mSRPT [1] in [24] Alexey Drutsa, Gleb Gusev, and Pavel Serdyukov. 2017. Using the Delay in a Treat-
sequential testing; network A/B testing [28, 58]; two-stage A/B ment Effect to Improve Sensitivity and Preserve Directionality of Engagement
testing [15] (also mentioned as tournaments in Sec. 2); and Imperfect Metrics in A/B Experiments. In WWW’2017.
[25] Alexey Drutsa, Anna Ufliand, and Gleb Gusev. 2015. Practical Aspects of Sensi-
Treatment Assignment [11]. tivity in Online Experimentation with User Engagement Metrics. In CIKM’2015.
763–772.
[26] Bradley Efron and Robert J Tibshirani. 1994. An introduction to the bootstrap.
TUTORIAL MATERIALS CRC press.
[27] Artem Grotov and Maarten de Rijke. 2016. Online learning to rank for information
The tutorial materials (slides) are available at https://research.yandex. retrieval: Tutorial. In SIGIR.
com/tutorials/online-evaluation/www-2018. [28] Huan Gui, Ya Xu, Anmol Bhasin, and Jiawei Han. 2015. Network a/b testing: From
sampling to estimation. In Proceedings of the 24th International Conference on
World Wide Web. International World Wide Web Conferences Steering Committee,
REFERENCES 399–409.
[29] Katja Hofmann, Shimon Whiteson, and Maarten De Rijke. 2011. A probabilistic
[1] Vineet Abhishek and Shie Mannor. 2017. A nonparametric sequential test for method for inferring preferences from clicks. In Proceedings of the 20th ACM
online randomized experiments. In Proceedings of the 26th International Conference international conference on Information and knowledge management. ACM, 249–
on World Wide Web Companion. International World Wide Web Conferences 258.
Steering Committee, 610–616. [30] Henning Hohnhold, Deirdre O’Brien, and Diane Tang. 2015. Focusing on the
[2] Olga Arkhipova, Lidia Grauer, Igor Kuralenok, and Pavel Serdyukov. 2015. Search Long-term: It’s Good for Users and Business. In KDD’2015. 1849–1858.
Engine Evaluation based on Search Engine Switching Prediction. In SIGIR’2015. [31] Thorsten Joachims. 2002. Unbiased evaluation of retrieval quality using click-
ACM, 723–726. through data. (2002).
[3] Susan Athey and Guido Imbens. 2015. Machine Learning Methods for Estimating [32] Thorsten Joachims et al. 2003. Evaluating Retrieval Performance Using Click-
Heterogeneous Causal Effects. arXiv preprint arXiv:1504.01132 (2015). through Data. (2003).
[4] Juliette Aurisset, Michael Ramm, and Joshua Parks. 2017. Innovating Faster on [33] Ramesh Johari, Pete Koomen, Leonid Pekelis, and David Walsh. 2017. Peeking
Personalization Algorithms at Netflix Using Interleaving. https://medium.com/ at A/B Tests: Why it matters, and what to do about it. In Proceedings of the 23rd
netflix-techblog/interleaving-in-online-experiments-at-netflix-a04ee392ec55. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
(2017). ACM, 1517–1525.
[5] Eytan Bakshy and Dean Eckles. 2013. Uncertainty in online experiments with [34] Rosie Jones and Kristina Lisa Klinkner. 2008. Beyond the session timeout: auto-
dependent data: An evaluation of bootstrap methods. In KDD’2013. 1303–1311. matic hierarchical segmentation of search topics in query logs. In Proceedings
[6] Paolo Boldi, Francesco Bonchi, Carlos Castillo, and Sebastiano Vigna. 2009. From of the 17th ACM conference on Information and knowledge management. ACM,
Dango to Japanese cakes: Query reformulation models and patterns. In Proceed- 699–708.
ings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence [35] Eugene Kharitonov, Alexey Drutsa, and Pavel Serdyukov. 2017. Learning Sensitive
and Intelligent Agent Technology-Volume 01. IEEE Computer Society, 183–190. Combinations of A/B Test Metrics. In WSDM’2017.
[7] Roman Budylin, Alexey Drutsa, Ilya Katsev, and Valeriya Tsoy. 2018. Consistent [36] Eugene Kharitonov, Craig Macdonald, Pavel Serdyukov, and Iadh Ounis. 2015.
Transformation of Ratio Metrics for Efficient Online Controlled Experiments. In Generalized Team Draft Interleaving. In CIKM’2015.
Proceedings of the Eleventh ACM International Conference on Web Search and Data [37] Eugene Kharitonov, Craig Macdonald, Pavel Serdyukov, and Iadh Ounis. 2015.
Mining. ACM, 55–63. Optimised Scheduling of Online Experiments. In SIGIR’2015. 453–462.
[8] Sunandan Chakraborty, Filip Radlinski, Milad Shokouhi, and Paul Baecke. 2014. [38] Eugene Kharitonov, Aleksandr Vorobev, Craig Macdonald, Pavel Serdyukov, and
On correlation of absence time and search effectiveness. In SIGIR’2014. 1163–1166. Iadh Ounis. 2015. Sequential Testing for Early Stopping of Online Experiments.
[9] Olivier Chapelle, Thorsten Joachims, Filip Radlinski, and Yisong Yue. 2012. Large- In SIGIR’2015. 473–482.
scale validation and analysis of interleaved search evaluation. ACM Transactions [39] Youngho Kim, Ahmed Hassan, Ryen W White, and Imed Zitouni. 2014. Model-
on Information Systems (TOIS) 30, 1 (2012), 6. ing dwell time to predict click-level satisfaction. In Proceedings of the 7th ACM
[10] Shuchi Chawla, Jason Hartline, and Denis Nekipelov. 2016. A/B testing of auctions. international conference on Web search and data mining. ACM, 193–202.
In EC’2016. [40] Ronny Kohavi, Thomas Crook, Roger Longbotham, Brian Frasca, Randy Henne,
[11] Dominic Coey and Michael Bailey. 2016. People and cookies: Imperfect treatment Juan Lavista Ferres, and Tamir Melamed. 2009. Online experimentation at Mi-
assignment in online experiments. In Proceedings of the 25th International Con- crosoft. Data Mining Case Studies (2009), 11.
ference on World Wide Web. International World Wide Web Conferences Steering [41] Ron Kohavi, Alex Deng, Brian Frasca, Roger Longbotham, Toby Walker, and Ya
Committee, 1103–1111. Xu. 2012. Trustworthy online controlled experiments: Five puzzling outcomes
[12] Thomas Crook, Brian Frasca, Ron Kohavi, and Roger Longbotham. 2009. Seven explained. In KDD’2012. 786–794.
pitfalls to avoid when running controlled experiments on the web. In KDD’2009. [42] Ron Kohavi, Alex Deng, Brian Frasca, Toby Walker, Ya Xu, and Nils Pohlmann.
1105–1114. 2013. Online controlled experiments at large scale. In KDD’2013. 1168–1176.
[13] Alex Deng. 2015. Objective Bayesian Two Sample Hypothesis Testing for Online [43] R. Kohavi, A. Deng, R. Longbotham, and Y. Xu. 2014. Seven Rules of Thumb for
Controlled Experiments. In WWW’2015 Companion. 923–928. Web Site Experimenters. In KDD’2014.
[14] Alex Deng and Victor Hu. 2015. Diluted Treatment Effect Estimation for Trigger [44] Ron Kohavi, Randal M Henne, and Dan Sommerfield. 2007. Practical guide to
Analysis in Online Controlled Experiments. In WSDM’2015. 349–358. controlled experiments on the web: listen to your customers not to the hippo. In
[15] Alex Deng, Tianxi Li, and Yu Guo. 2014. Statistical inference in two-stage online KDD’2007. 959–967.
controlled experiments with treatment selection and validation. In WWW’2014. [45] Ron Kohavi, Roger Longbotham, Dan Sommerfield, and Randal M Henne. 2009.
609–618. Controlled experiments on the web: survey and practical guide. Data Min. Knowl.
[16] Alex Deng, Jiannan Lu, and Shouyuan Chen. 2016. Continuous Monitoring of Discov. 18, 1 (2009), 140–181.
A/B Tests without Pain: Optional Stopping in Bayesian Testing. In DSAA’2016.
TheWebConf 2018, April 23–27, 2018, Roman
Lyon, France
Budylin, Alexey Drutsa, Gleb Gusev, Eugene Kharitonov, Pavel Serdyukov, and Igor Yashkov

[46] Ron Kohavi, David Messner, Seth Eliot, Juan Lavista Ferres, Randy Henne, Vig-
nesh Kannappan, and Justin Wang. 2010. Tracking Users’ Clicks and Submits:
Tradeoffs between User Experience and Data Loss. (2010).
[47] Mounia Lalmas, Heather O’Brien, and Elad Yom-Tov. 2014. Measuring user
engagement. Synthesis Lectures on Information Concepts, Retrieval, and Services 6,
4 (2014), 1–132.
[48] Kirill Nikolaev, Alexey Drutsa, Ekaterina Gladkikh, Alexander Ulianov, Gleb
Gusev, and Pavel Serdyukov. 2015. Extreme States Distribution Decomposition
Method for Search Engine Online Evaluation. In KDD’2015. 845–854.
[49] Eric T Peterson. 2004. Web analytics demystified: a marketer’s guide to under-
standing how your web site affects your business. Ingram.
[50] Scott Powers, Junyang Qian, Kenneth Jung, Alejandro Schuler, Nigam H Shah,
Trevor Hastie, and Robert Tibshirani. 2017. Some methods for heterogeneous
treatment effect estimation in high-dimensions. arXiv preprint arXiv:1707.00102
(2017).
[51] Alexey Poyarkov, Alexey Drutsa, Andrey Khalyavin, Gleb Gusev, and Pavel
Serdyukov. 2016. Boosted Decision Tree Regression Adjustment for Variance
Reduction in Online Controlled Experiments. In KDD’2016. 235–244.
[52] Filip Radlinski. 2013. Sensitive Online Search Evaluation. http://irsg.bcs.org/
SearchSolutions/2013/presentations/radlinski.pdf. (2013).
[53] Filip Radlinski and Nick Craswell. 2013. Optimized interleaving for online re-
trieval evaluation. In WSDM.
[54] Filip Radlinski and Katja Hofmann. 2013. Practical online retrieval evaluation. In
ECIR.
[55] Filip Radlinski, Madhu Kurup, and Thorsten Joachims. 2008. How does click-
through data reflect retrieval quality?. In CIKM’2008. 43–52.
[56] Filip Radlinski and Yisong Yue. 2011. Practical Online Retrieval Evaluation. In
SIGIR.
[57] Kerry Rodden, Hilary Hutchinson, and Xin Fu. 2010. Measuring the user experi-
ence on a large scale: user-centered metrics for web applications. In CHI’2010.
2395–2398.
[58] Martin Saveski, Jean Pouget-Abadie, Guillaume Saint-Jacques, Weitao Duan,
Souvik Ghosh, Ya Xu, and Edoardo M Airoldi. 2017. Detecting network effects:
Randomizing over randomized experiments. In Proceedings of the 23rd ACM
SIGKDD international conference on knowledge discovery and data mining. ACM,
1027–1035.
[59] Anne Schuth, Floor Sietsma, Shimon Whiteson, Damien Lefortier, and Maarten
de Rijke. 2014. Multileaved comparisons for fast online evaluation. In CIKM.
[60] Milad Shokouhi. 2011. Detecting seasonal queries by time-series analysis. In
Proceedings of the 34th international ACM SIGIR conference on Research and devel-
opment in Information Retrieval. ACM, 1171–1172.
[61] Yang Song, Xiaolin Shi, and Xin Fu. 2013. Evaluating and predicting user engage-
ment change with degraded search relevance. In WWW’2013. 1213–1224.
[62] Diane Tang, Ashish Agarwal, Deirdre O’Brien, and Mike Meyer. 2010. Overlap-
ping experiment infrastructure: More, better, faster experimentation. In KDD’2010.
17–26.
[63] Huizhi Xie and Juliette Aurisset. 2016. Improving the Sensitivity of Online
Controlled Experiments: Case Studies at Netflix. In KDD’2016.
[64] Ya Xu and Nanyu Chen. 2016. Evaluating Mobile Apps with A/B and Quasi A/B
Tests. In KDD’2016.
[65] Ya Xu, Nanyu Chen, Addrian Fernandez, Omar Sinno, and Anmol Bhasin. 2015.
From infrastructure to culture: A/B testing challenges in large scale social net-
works. In KDD’2015.

View publication stats

You might also like