摘要 :
A/B testing, also referred to as online controlled experimentation or continuous experimentation, is a form of hypothesis testing where two variants of a piece of software are compared in the field from an end user's point of view...
展开
A/B testing, also referred to as online controlled experimentation or continuous experimentation, is a form of hypothesis testing where two variants of a piece of software are compared in the field from an end user's point of view. A/B testing is widely used in practice to enable data-driven decision making for software development. While a few studies have explored different facets of research on A/B testing, no comprehensive study has been conducted on the state-of-the-art in A/B testing. Such a study is crucial to provide a systematic overview of the field of A/B testing driving future research forward. To address this gap and provide an overview of the state-of-the-art in A/B testing, this paper reports the results of a systematic literature review that analyzed primary studies. The research questions focused on the subject of A/B testing, how A/B tests are designed and executed, what roles stakeholders have in this process, and the open challenges in the area. Analysis of the extracted data shows that the main targets of A/B testing are algorithms, visual elements, and workflow and processes. Single classic A/B tests are the dominating type of tests, primarily based in hypothesis tests. Stakeholders have three main roles in the design of A/B tests: concept designer, experiment architect, and setup technician. The primary types of data collected during the execution of A/B tests are product/system data, user-centric data, and spatio-temporal data. The dominating use of the test results are feature selection, feature rollout, continued feature development, and subsequent A/B test design. Stakeholders have two main roles during A/B test execution: experiment coordinator and experiment assessor. The main reported open problems are related to the enhancement of proposed approaches and their usability. From our study we derived three interesting lines for future research: strengthen the adoption of statistical methods in A/B testing, improving the process of A/B testing, and enhancing the automation of A/B testing.
收起
摘要 :
Marketers know that running experiments is a proven way to improve results and gain competitive advantage against rivals. Despite this knowledge-and the fact that experiments are now easier to conduct than ever before-data shows t...
展开
Marketers know that running experiments is a proven way to improve results and gain competitive advantage against rivals. Despite this knowledge-and the fact that experiments are now easier to conduct than ever before-data shows that marketers consistently under-experiment. In this article, we examine why this gap exists and what can be done to close it. We do so by connecting with senior-level marketing professionals representing seven consumer-facing industries in two phases. First, through a series of interviews, we gain initial understanding of the concerns, challenges, and realities of those working in the industry. Following this phase, we surveyed a larger group to corroborate and extend our initial findings, comparing cases to identify challenges and the strategies used to overcome them. We present our findings as a series of experimentation myths before closing with a broader perspective on how organizations can infuse experimentation into their culture. (C) 2022 Kelley School of Business, Indiana University. Published by Elsevier Inc. All rights reserved.
收起
摘要 :
Owing to their physical flexibility and exceptional sensitivity to ultrasonic waves, thin-film graphene-based nanocomposite sensors have been gaining prominence in ultrasonic testing-based structural health monitoring (UT-SHM) app...
展开
Owing to their physical flexibility and exceptional sensitivity to ultrasonic waves, thin-film graphene-based nanocomposite sensors have been gaining prominence in ultrasonic testing-based structural health monitoring (UT-SHM) applications. However, both the electrical conductivities of this new class of sensors and their adhesion on monitoring targets have been found to be highly dependent on temperature. Consequently, under excessive temperature variations, signals that would be output would be disturbed and unable to reflect the health conditions of the monitoring targets, undermining the accuracy of the health monitoring. Herein, we propose a high-thermal-stability thin-film graphene/polyamide-imide sensor for acquiring ultrasonic waves under unstable temperature conditions. The sensor consists of three layers, namely a polyamide-imide-based insulation/adhesion layer (bottom), a graphene/polyamide-imide-based sensing layer (middle), and a silver-based electrode layer (top). It is fabricated by ultrasonic atomization-assisted spray coating and can be formed directly on monitoring targets. Thanks to the adoption of polyamide-imide, the sensor retains a steady electrical conductivity and a strong adhesion on monitoring targets up to 160 degrees C. As a result, its sensitivity to ultrasonic waves exhibits only marginal changes. All in all, this work further promotes the implementation of thin-film graphene-based nanocomposite sensors in real-life UT-SHM applications.
收起
摘要 :
Online controlled experiments, or A/B testing, has been a standard framework adopted by most online product companies to measure the effect of any new change. Companies use various statistical methods including hypothesis testing ...
展开
Online controlled experiments, or A/B testing, has been a standard framework adopted by most online product companies to measure the effect of any new change. Companies use various statistical methods including hypothesis testing and statistical inference to quantify the business impact of the changes and make product decisions. Nowadays, experimentation platforms can run as many as hundreds or even more experiments concurrently. When a group of experiments is conducted, usually the ones with significant successful results are chosen to be launched into the product. We are interested in learning the aggregated impact of the launched features. In this paper, we investigate a statistical selection bias in this process and propose a correction method of getting an unbiased estimator. Moreover, we give an implementation example at Airbnb's ERF platform (Experiment Reporting Framework) and discuss the best practices to account for this bias.
收起
摘要 :
This work presents an original broadband method able to determine the homogenized mechanical properties of orthotropic composite plates in their main directions. This method is based on a vibratory inverse method and do not need u...
展开
This work presents an original broadband method able to determine the homogenized mechanical properties of orthotropic composite plates in their main directions. This method is based on a vibratory inverse method and do not need using Eigen mode. This technique is derived from the Force Analysis Technique (FAT), originally designed to identify vibration sources on bars and plates. The main advantage of this method is that it does not require any specific preparation and can be carried out on a real and complex structure. Moreover it is non-destructive and non-invasive method and only requires the excitation by a classical electromagnetic shaker. The main characteristic mechanical parameters (Young's moduli and damping loss factors) are identified at all measured frequencies and not only at the Eigen frequencies as compared to others measurement techniques. An experimental validation shows a good correlation with the expected values for the Young's moduli of the test plate. Working on a complex structure shows that the method is applicable for various configurations of operating conditions and leads to a good determination of the mechanical properties of a planar component.
收起
摘要 :
The mechanical properties of high-quality cubic boron nitride (cBN) films were systematically investigated by nanoindentation measurements performed in both cross-sectional and plan-view directions. The large film thickness (~5μ...
展开
The mechanical properties of high-quality cubic boron nitride (cBN) films were systematically investigated by nanoindentation measurements performed in both cross-sectional and plan-view directions. The large film thickness (~5μm) allows the effective ruling out of both substrate and indenter size effects. The hardness and elastic modulus values were found to be 70 and 800 GPa, respectively, which are the highest values ever obtained on cBN films deposited by either PVD or CVD methods so far (comparable to those reported for cBN crystals synthesized by high-pressure high-temperature methods). The variation of hardness across the cBN film thickness was investigated. In conjunction with the transmission electron microscopic observations, the relationship of the hardness measured with the crystallinity and crystal size/grain boundaries was discussed.
收起
摘要 :
Isotropic and kinematic hardening is combined with the local, average lattice rotaion in order to describe the hardening behaviour of a polycrystallien metal in terms of continuum mechan- ics. This approach is adapted by means of ...
展开
Isotropic and kinematic hardening is combined with the local, average lattice rotaion in order to describe the hardening behaviour of a polycrystallien metal in terms of continuum mechan- ics. This approach is adapted by means of parameter identification to experiemtnal results by Boehler and Koss (In: Bruller O., Mannl V., Najar J. (Eds.), Advances in Continuum mechanics. Springer, Berlin, (1991) pp. 143) Koss (Ecrouissage isotrope et anisotrope des aciers laminees Soumis a de grandes deformations.
收起
摘要 :
Online search evaluation, and A/B testing in particular, is an irreplaceable
tool for modern search engines. Typically, online experiments
last for several days or weeks and require a considerable
portion of the search traffic....
展开
Online search evaluation, and A/B testing in particular, is an irreplaceable
tool for modern search engines. Typically, online experiments
last for several days or weeks and require a considerable
portion of the search traffic. Despite the increasing need for running
more experiments, the amount of that traffic is limited. This
situation leads to the problem of finding new key performance
metrics with higher sensitivity and lower variance. Recently, we
proposed a number of techniques to alleviate this need for larger
sample sizes in A/B experiments.
One approach was based on formulating the quest for finding
a sensitive metric as a data-driven machine learning problem of
finding a sensitive metric combination [2]. We assumed that each
single observation in these experiments is assigned with a vector
of metrics (features) describing it. After that, we learned a linear
combination of these metrics, such that the learned combination can
be considered as a metric itself, and (a) agrees with the preference
direction in the seed experiments according to a baseline ground
truth metric, (b) achieves a higher sensitivity than the baseline
ground-truth metric.
Another approach addressed the problem of delays in the treatment
effects causing low sensitivity of the metrics and requiring
to conduct A/B experiments with longer duration or larger set of
users from a limited traffic [1]. We found that a delayed treatment
effect of a metric could be revealed through the daily time series
of the metricâĂŹs measurements over the days of an A/B test. So,
we proposed several metrics that learn the models of the trend in
such time series and use them to quantify the changes in the user
behavior.
Finally, in another study [3], we addressed the problem of variance
reduction for user engagement metrics and developed a general
framework that allows us to incorporate both the existing
state-of-the-art approaches to reduce the variance and some novel
ones based on advanced machine learning techniques. The expected
value of the key metric for a given user consists of two components:
(1) the expected value for this user irrespectively the treatment assignment
and (2) the treatment effect for this user. The expectation
of the 1st component does not depend on the treatment assignment
and does not contribute to the actual average treatment effect, but
may increase the variance of its estimation. If we knew the value of the first component, we would subtract it from the key metric and
obtain a new metric with decreased variance. However, since we
cannot evaluate the first component exactly, we propose to predict
it based on the attributes of the user that are independent of the
treatment exposure. Therefore, we propose to utilize, instead of
the average value of a key metric, its average deviation from its
predicted value. In this way, the problem of variance reduction is
reduced to the problem of finding the best predictor for the key
metric that is not aware of the treatment exposure. In our general
approach, we apply gradient boosted decision trees and achieve a
significantly greater variance reduction than the state-of-the-art.
收起
摘要 :
Many online experiments exhibit dependence between users and items. For example, in online advertising, observations that have a user or an ad in common are likely to be associated. Because of this, even in experiments involving m...
展开
Many online experiments exhibit dependence between users and items. For example, in online advertising, observations that have a user or an ad in common are likely to be associated. Because of this, even in experiments involving millions of subjects, the difference in mean outcomes between control and treatment conditions can have substantial variance. Previous theoretical and simulation results demonstrate that not accounting for this kind of dependence structure can result in confidence intervals that are too narrow, leading to inaccurate hypothesis tests. We develop a framework for understanding how dependence affects uncertainty in user-item experiments and evaluate how bootstrap methods that account for differing levels of dependence perform in practice. We use three real datasets describing user behaviors on Facebook - user responses to ads, search results, and News Feed stories - to generate data for synthetic experiments in which there is no effect of the treatment on average by design. We then estimate empirical Type I error rates for each bootstrap method. Accounting for dependence within a single type of unit (i.e., within-user dependence) is often sufficient to get reasonable error rates. But when experiments have effects, as one might expect in the field, accounting for multiple units with a multiway bootstrap can be necessary to get close to the advertised Type I error rates. This work provides guidance to practitioners evaluating large-scale experiments, and highlights the importance of analysis of inferential methods for dependence structures common to online systems.
收起
摘要 :
Online controlled experiments (a.k.a. A/B testing) have been used as the mantra for data-driven decision making on feature changing and product shipping in many Internet companies. However, it is still a great challenge to systema...
展开
Online controlled experiments (a.k.a. A/B testing) have been used as the mantra for data-driven decision making on feature changing and product shipping in many Internet companies. However, it is still a great challenge to systematically measure how every code or feature change impacts millions of users with great heterogeneity (e.g. countries, ages, devices). The most commonly used A/B testing framework in many companies is based on Average Treatment Effect (ATE), which cannot detect the heterogeneity of treatment effect on users with different characteristics. In this paper, we propose statistical methods that can systematically and accurately identify Heterogeneous Treatment Effect (HTE) of any user cohort of interest (e.g. mobile device type, country), and determine which factors (e.g. age, gender) of users contribute to the heterogeneity of the treatment effect in an A/B test. By applying these methods on both simulation data and real-world experimentation data, we show how they work robustly with controlled low False Discover Rate (FDR), and at the same time, provides us with useful insights about the heterogeneity of identified user groups. We have deployed a toolkit based on these methods, and have used it to measure the Heterogeneous Treatment Effect of many A/B tests at Snap.
收起