Concepts • A/B Testing
 Overview of A/B Testing
 Benefits of A/B Testing
 Delving into Metrics
 Designing an AB Test: Summary
 Key terms
 Running an A/B Test: The Process
 Switchback Experiments
 Conclusion
 From Statistical to Causal Learning
 References
 Further Reading
Overview of A/B Testing
 Experimentation, particularly A/B testing, offers a strategic method to make informed decisions about product and service improvements. Instead of relying solely on expert opinions or executive decisions, A/B testing provides a platform for users or customers to express their preferences through their interactions.
 The Scientific Approach: A/B testing operates on the scientific method. It starts with forming hypotheses, collecting empirical data through experiments, and then drawing conclusions based on evidence. This method promotes a cyclical pattern of deduction and induction, propelling further hypotheses and tests.
Benefits of A/B Testing
 Gives Users a Voice: A/B testing ensures that user feedback is central to product changes, emphasizing the importance of user experience in decisionmaking.
 Facilitates Causal Inferences: By randomly assigning users to different groups, and introducing changes only to specific groups, it’s possible to determine the causal effects of a change. This approach isolates the introduced variables, making results more reliable.
 Confidence in Decisions: By gathering concrete data on user preferences, companies can confidently implement changes, knowing they are backed by real user feedback.
Delving into Metrics

Tailored Metrics: The metrics chosen for evaluation vary depending on the experiment. For instance, a test focused on improving search results might measure user engagement with search functions. Other experiments could assess technical performance, like app loading times or video streaming quality under varied conditions.

Interpreting Metrics Carefully: Surface metrics, like clickthrough rates, might not always give a comprehensive view. For example, increased clicks might only indicate users trying to understand a feature rather than genuine interest. Hence, a multimetric approach can provide a holistic view, gauging overall user satisfaction and engagement levels.
Incorporating Sampling Methods in A/B Testing
 Sampling methods significantly influence the reliability and relevance of A/B testing outcomes. The choice of sampling method hinges on various factors, including data quality, precision requirements, and cost considerations.
Simple Random Sampling
 Simple Random Sampling (SRS) ensures every potential sample stands an equal chance of selection. While this method can mitigate bias, it might sometimes fail to represent the population adequately.
Systematic Sampling
 Systematic Sampling involves periodic selection from an ordered list. This approach can efficiently represent populations, but it’s essential to be wary of biases that might arise due to the structure of the data.
Stratified Sampling
 In scenarios where the testing population spans distinct categories, Stratified Sampling becomes relevant. By dividing the population into individual strata and sampling each independently, this method can provide richer insights, more accurate statistical estimates, and flexibility.
Designing an AB Test: Summary
 Statistical Considerations in A/B Testing
 Understanding the 5% Significance Level:
 The standard practice in A/B testing is to set the acceptable false positive rate at 5%. This means that even if there is no actual difference between treatment and control groups, the test will indicate a significant difference 5% of the time.
 This convention is akin to mistaking noncat photos for cat photos 5% of the time.
 The concept of the 5% false positive rate is deeply connected to “statistical significance” and the pvalue. The pvalue calculates how probable it is to observe a result as extreme as the one from the A/B test, under the assumption that there’s no real difference between the groups.
 Reasons for the 5% Standard in A/B Testing:
 Balancing Errors:
 Statistical testing involves balancing Type I (false positives) and Type II (false negatives) errors. The 5% rate ensures a manageable balance between these errors.
 Tradition and Convention:
 The 5% significance level has become an accepted threshold across many research fields, primarily because of historical reasons and its practical implications.
 Risk Tolerance:
 Using a 5% significance level is essentially expressing comfort with a 5% chance of making a false positive error.
 Historical Precedence:
 Influential statisticians, notably Ronald A. Fisher, have advocated for this 5% level, and it has since become a norm.
 Practical Considerations:
 Setting the threshold too low could result in overlooking genuine effects, while setting it too high could lead to many false discoveries.
 Statistical Decision Framework Using Coin Flips:
 Objective and Experiment:
 The goal is to determine if a coin is unfair. This is analogous to business situations where one wishes to see if a new feature changes user behavior.
 The decision framework helps decide when there’s enough evidence to reject the assumption that the coin is fair.
 Decision Making with pvalue:
 A pvalue less than 0.05 would typically be considered strong evidence against the fairness of the coin.
 False Positives and Confidence Intervals:
 The rejection region consists of outcomes that are deemed too extreme under the null hypothesis, leading to its rejection.
 Confidence intervals, conversely, are ranges within which the true parameter value is believed to lie with a certain level of confidence.
 Understanding False Negatives and Power:
 Power refers to the probability of correctly detecting a genuine effect. It’s connected to the concept of false negatives and is expressed as 1 minus the false negative rate.
 Improving Power in A/B Tests:
 Power can be enhanced by:
 Effect Size: Larger differences between groups are easier to detect.
 Sample Size: More data helps in achieving a more accurate representation of the population.
 Reducing Variability: Making metrics more consistent within groups makes true effects easier to spot.
 A/B tests, when grounded in statistical principles, provide valuable insights. While they offer rigorous evidence about variations in treatment effects, these insights often form just a piece of a broader decisionmaking process.
Key terms
1. Null Hypothesis (( H_0 )):
 The null hypothesis is a central concept in statistical hypothesis testing. It posits that there’s no significant difference between specified populations or that a particular parameter is equal to a specific value. In A/B testing, ( H_0 ) typically states that there’s no difference between treatment and control groups. It’s the default assumption, and statistical tests aim to challenge and possibly reject it.
2. RCT (Randomized Controlled Trial):
 RCTs are experiments used to assess the efficacy of interventions. Participants are randomly assigned to treatment (experimental) or control groups to ensure that any observed effects can be attributed to the intervention itself and not other external factors.
3. Controlling for Confounders:
 A confounder is a third factor in an experiment that could cause variations. If not controlled, confounders can lead to incorrect conclusions about the relationship between the independent and dependent variables. By controlling these confounders, researchers can be more confident that the results observed are due to the treatment itself.
4. Instrumental Variables:
 In situations where direct experimentation isn’t possible due to ethical or practical reasons, instrumental variables can be used. They are variables that don’t directly affect the outcome but are related to the cause of interest and can help determine causality.
5. Treatment and Control:
 In A/B testing or experiments:
 Treatment Group: Receives the intervention or change being tested.
 Control Group: Continues without any intervention, serving as a baseline for comparison.
6. Bias:
 Bias refers to any systematic error in an experiment or study that skews the results. In A/B testing, it’s crucial to identify and minimize biases to ensure the test’s conclusions are valid.
7. Alpha:
 In hypothesis testing, alpha is the probability of rejecting the null hypothesis when it is actually true. This is known as a Type I error. Typically, an alpha of 0.05 is used, meaning there’s a 5% chance of falsely rejecting the null hypothesis.
8. Beta:
 Beta represents the probability of accepting the null hypothesis when it is false. This is termed a Type II error. The power of a test is 1 minus beta, representing the probability of rejecting the null hypothesis when it’s false.
9. Power:
 Power measures the ability of a test to detect an effect if one truly exists. It’s the probability that a test will reject the null hypothesis when it should (i.e., when the alternative hypothesis is true). A highpowered test is more likely to detect significant differences.
 Sure! We can further expand on some aspects and add more depth to the technical side of these concepts. Then, we’ll wrap it up with a conclusion.
Effect Size:
 Effect size is a measure of the magnitude of a phenomenon or intervention’s effect. While pvalues tell us if an effect exists, effect size gives us the magnitude of that effect. It’s particularly important because a statistically significant result doesn’t necessarily mean it’s practically significant. Common measures include Cohen’s d for ttests and the odds ratio for chisquared tests.
Pvalue:
 The pvalue measures the strength of the evidence against a null hypothesis. A smaller pvalue indicates stronger evidence against the null hypothesis, leading researchers to reject it. However, it’s worth noting that a pvalue doesn’t provide the probability that either hypothesis is true; it merely indicates the strength of evidence against ( H_0 ).
Confidence Intervals (CI):
 A confidence interval gives an estimated range of values that’s likely to include an unknown population parameter. If you have a 95% CI of (2, 8), it means that if you were to conduct the experiment numerous times, you’d expect the parameter to fall within this range 95% of the time. It’s a way of expressing the reliability of an estimate.
External vs. Internal Validity:

External Validity: Refers to the extent to which the results of a study generalize to settings, people, times, and measures other than the ones used in the study.

Internal Validity: Concerns whether the intervention (and not other factors) caused the observed effect. High internal validity is often achieved by controlling for confounders, randomizing participants, and using proper blinding techniques.
FDR (False Discovery Rate):
 In multiple hypothesis testing, the FDR controls the expected proportion of false positives. It’s a method used to correct for multiple comparisons and is especially important in fields like genomics where thousands of hypotheses are tested simultaneously.
Running an A/B Test: The Process
 Here’s a stepbystep breakdown:
 Planning: Before launching the test, you’ll decide on what change you’re going to test (e.g., a new feature, a different user interface, etc.), determine the success metrics (e.g., clickthrough rate, conversion rate), and decide how many users you need for statistical significance.
 Allocation:
 Before Launch: You’ll decide on the method of allocation (e.g., random assignment, stratified sampling, etc.) and the proportion of users to allocate to each group (e.g., 50% to A and 50% to B, or maybe 90% to A and 10% to B if you’re doing a more cautious rollout). You’ll also set up the infrastructure to assign users to groups.
 During Launch: When the test is actually launched, users will be dynamically allocated to one of the groups based on the predetermined methodology. For example, if it’s random allocation, each user entering the site might have a 50% chance of being in group A and a 50% chance of being in group B.
 Note that the strategy and methodology for allocation are decided before the test is launched, but the actual allocation of individual users to groups occurs dynamically as part of the test execution as users interact with the platform during the test period.
 Execution: Once users are allocated, they’ll experience either the original version (Group A) or the changed version (Group B) of the website/app/feature.
 Analysis: After enough data is collected, you’ll analyze the results based on the metrics you’ve chosen.
 Conclusion: Based on the results, you’ll decide whether to implement the change for all users, iterate and run another test, or reject the change.
Switchback Experiments
 Development and Popularity:
 Used by companies like Uber, Lyft, Doordash, Amazon, and Rover, highlighting switchback tests’ widespread adoption and effectiveness.
 Why Switchback Tests?
 Designed to measure impact in scenarios with strong network effects and user interdependence.
 Particularly relevant in social media apps and twosided marketplaces, where traditional A/B testing can be misleading.
 Example: In ridesharing services, changes affecting rider behavior can also impact driver supply, demonstrating the interconnectedness in such marketplaces.
 Key Features of Switchback Tests:
 Mechanism: Involves alternating between test and control treatments over time, ensuring uniform treatment across a network at any given moment.
 Advantages: Effective in situations with immediate test effects and strong interference, overcoming limitations of traditional A/B tests.
 Implementation Details:
 Time Interval Determination: Essential for successful experiments, balancing the need to capture effects with the requirement for ample test and control samples.
 Schedule Creation: Involves assigning test and control intervals, with strategies ranging from simple randomization to complex, productspecific algorithms.
 Segmentation and Clustering: Enhances sampling and data point collection by dividing users into independent clusters, like cities.
 BurnIn and BurnOut Periods: Excludes data near switching boundaries to prevent contamination between test and control effects.
 Analysis Techniques:
 Challenges with Traditional Methods: The interdependence of users makes standard statistical tests like ttests unsuitable.
 Preferred Approaches: Regression analysis for specific metrics and products, and bootstrapping analysis for a more generalizable solution.
 Practical Considerations and Limitations:
 Suitability for ShortTerm Effects: Optimal for assessing transactional behaviors within single sessions, not for longterm impacts.
 Importance of Setup: Effective implementation relies on selecting appropriate time windows and clustering parameters, necessitating deep domain knowledge.
 This comprehensive summary emphasizes the unique aspects, development history, implementation strategies, and practical applications of switchback experiments, highlighting their critical role in scenarios where traditional A/B testing falls short due to network effects.
 For more, refer Switchback experiments: Overview and considerations.
Summary
 By popular demand, we partnered with existing Statsig customers to build this novel mode of experimentation.
 Why Switchback tests?
 A switchback test is a type of experimentation that can measure impact when network effects are in play. It applies to situations where user behavior is not independent, and the actions of one user can affect another (commonly referred to as interference). This is common in social media apps and 2sided marketplaces where simple A/B tests can produce inaccurate results.
 A canonical example of this scenario comes from ridesharing services like Lyft: All the riders in an area share the same supply of drivers, and vice versa. In these cases, a test that impacts the probability of riders booking a ride (e.g., providing a discount code) also affects the supply of drivers available to the control group.
 Switchback solves this problem by switching the user experience back and forth by randomizing on units of time, instead of users! This method was popularized by Uber and Lyft, and used by companies like Doordash, Amazon, and Rover. It’s great for situations where the test effect is immediate, and interference is strong.
Conclusion
 Understanding the intricacies of statistical and experimental concepts is paramount in research. Whether one is conducting A/B tests, clinical trials, or social science experiments, having a solid grasp of hypothesis testing, the importance of controlling confounders, and the implications of biases can significantly affect the outcomes. These tools, combined with an appreciation for both the magnitude (effect size) and the significance (pvalue) of results, ensure that research findings are both scientifically robust and practically relevant. Furthermore, recognizing the balance between internal and external validity can guide research design, allowing for results that are both causally accurate and generalizable. As research continues to evolve with technological advancements, maintaining a foundational understanding of these concepts is essential for producing reliable, actionable, and impactful insights.
From Statistical to Causal Learning
 In From Statistical to Causal Learning, Schölkopf and Kügelgen describe basic ideas underlying research to build and understand artificially intelligent systems: from symbolic approaches via statistical learning to interventional models relying on concepts of causality. Some of the hard open problems of machine learning and AI are intrinsically related to causality, and progress may require advances in our understanding of how to model and infer causality from data.
References
 Trustworthy Online Controlled Experiments: A Practical Guide to A/B Testing
 HBR: The Surprising Power of Online Experiments
 HBR: Avoid the Pitfalls of A/B Testing
 A Crash Course on Online Experiments
 A/B/n Testing
 Goals Gone Wild: The Systematic Side Effects of OverPrescribing Goal Setting and Goodhart’s Law
 Reliance on Metrics is a Fundamental Challenge for AI
 Statistical Power Analysis
 Type III Error
 Adobe Target: Sample Size Calculator
 Sample Size Calculation with Both Confidence Level and Power
 The Significance of Size & Power Calculation in A/B Tests
 UNICEF’s Report on Impact Evaluation Using Randomized Controlled Trials
 How long should you run an A/B Test?
 The essential guide to Sample Ratio Mismatch for your A/B tests
 Metaflow: quick and easy reallife data science and ML projects
Further Reading
 Netflix:
 Decision Making at Netflix
 What is an A/B Test?
 Interpreting A/B test results: false positives and statistical significance
 Interpreting A/B test results: false negatives and power
 Building confidence in a decision
 Experimentation is a major focus of Data Science across Netflix
 Computational Causal Inference at Netflix
 It’s All A/Bout Testing: The Netflix Experimentation Platform
 A Survey of Causal Inference Applications at Netflix
 A/B Testing and Beyond: Improving the Netflix Streaming Experience with Experimentation and Data Science
 Quasi Experimentation at Netflix
 LinkedIn: Experimentation tag on engineering blog; notably, building inclusive products through A/B testing and “no production release at the company happens without experimentation”
 Google: Decision Intelligence with Cassie Kozyrkov
 Uber: “Experimentation is at the core of how Uber improves the customer experience”
 Airbnb: Experimentation tag on engineering blog
 Intuit: Meet Wasabi, an Open Source A/B Testing Platform
 Microsoft: Online Experimentation at Microsoft
 Disney: Universal Holdout Groups at Disney Streaming
 Sudeep Das & Aish Fenton at Netflix on Causal Inference