Skill Guide

A/B test design, power analysis, and sequential testing

A/B test design, power analysis, and sequential testing constitute the rigorous statistical methodology for planning, sizing, and monitoring controlled experiments to make data-driven product and business decisions while minimizing false positives and maximizing efficiency.

This skill is the bedrock of evidence-based growth and optimization, directly reducing wasted development resources on ineffective changes and increasing the confidence and ROI of product iterations. Mastering it enables organizations to systematically de-risk innovation and allocate engineering effort to features with proven impact.

1 Careers

1 Categories

8.7 Avg Demand

20% Avg AI Risk

How to Learn A/B test design, power analysis, and sequential testing

1. **Core Concepts & Glossary**: Understand the null hypothesis (H0), alternative hypothesis (H1), p-value, significance level (α), power (1-β), Minimum Detectable Effect (MDE), and sample size. 2. **Basic Design Principles**: Learn randomization, control/treatment group definition, and choosing a primary metric. 3. **Using Online Calculators**: Practice running power analyses for simple experiments using free tools like Evan Miller's calculator to internalize the relationship between sample size, MDE, and baseline conversion rate.

1. **Go Beyond Binary Metrics**: Apply testing to continuous metrics (e.g., revenue, time spent) and understand the difference in power analysis. 2. **Common Pitfalls & Corrections**: Learn to identify and mitigate issues like sample ratio mismatch (SRM), multiple testing, and novelty effects. 3. **Tool Proficiency**: Implement power analysis using Python libraries (`statsmodels.stats.power`) or R (`pwr`) for custom scenarios, moving beyond online calculators. 4. **Sizing for Business Impact**: Connect statistical MDE to a minimum viable business impact (e.g., a $50k annual revenue lift).

1. **Architecting Experimentation Systems**: Design multi-layered or mutually exclusive experiment frameworks to manage concurrent tests at scale. 2. **Strategic MDE Setting**: Align MDE with long-term product strategy, considering opportunity cost and iterative learning cycles. 3. **Mentorship & Process Leadership**: Develop internal guidelines, review experiment designs from peers, and coach teams on choosing the right sequential method (e.g., AGILE vs. Bayesian) based on traffic, velocity, and risk tolerance.

Practice Projects

Beginner

Project

E-commerce Button Color Test Design

Scenario

Your product manager wants to test if changing a 'Buy Now' button from blue (current) to green will increase click-through rate (CTR). The current CTR is 5%. You want to detect at least a 10% relative increase (to 5.5%).

How to Execute

1. **Define Parameters**: Set α = 0.05, Power = 0.8, Baseline Rate = 5%, MDE = 0.5% absolute (10% relative). 2. **Calculate Sample Size**: Use an online calculator or a Python script to determine the required sample size per variation. 3. **Write Design Doc**: Document the hypothesis, primary metric (CTR), randomization unit (user), sample size, and duration (based on daily traffic). 4. **Simulate Data**: Use a spreadsheet to generate random binomial data for control and treatment with the assumed rates to practice analysis.

Intermediate

Case Study/Exercise

Optimizing a Mobile Game's Tutorial Completion Rate with Sequential Monitoring

Scenario

A mobile game studio is running an A/B test on a simplified tutorial. Traffic is high but the team is impatient for results. They want to stop the test as soon as there's a clear winner without inflating the false positive rate.

How to Execute

1. **Choose a Sequential Method**: Select a method like Always Valid Inference (AVI) or Bayesian updating with a stopping threshold. Avoid standard fixed-horizon tests. 2. **Set Stopping Boundaries**: For AVI, calculate the evidence threshold (e.g., log Bayes factor > 5) that must be crossed to stop early. 3. **Implement Monitoring Dashboard**: Create a dashboard that updates the test statistic daily against the stopping boundary, not just the p-value. 4. **Run a Pre-mortem**: Document the decision rule upfront: 'We will stop for efficacy if X, stop for futility if Y, and continue otherwise.'

Advanced

Project

Implementing a Multi-Armed Bandit (MAB) for Personalized News Feed Ranking

Scenario

You are the lead data scientist for a content platform. The goal is to dynamically allocate more traffic to better-performing news feed ranking algorithms in real-time, rather than waiting for a classical A/B test to conclude. This requires balancing exploration (learning) and exploitation (showing the best).

How to Execute

1. **Problem Formulation**: Frame ranking as a contextual bandit problem where context = user features, arms = ranking models, reward = engagement. 2. **Select Algorithm**: Choose Thompson Sampling (Bayesian) or Epsilon-Greedy for simplicity and effectiveness. 3. **Design Reward & Logging**: Define a clear reward signal (e.g., dwell time > 30s) and ensure full logging of contexts, chosen arms, and rewards for offline policy evaluation. 4. **Build the System**: Develop a service that receives user context, samples from the posterior (Thompson) or selects an arm (Epsilon-Greedy), serves the result, and updates the model parameters asynchronously.

Tools & Frameworks

Statistical Software & Libraries

Python: `statsmodels.stats.power`, `scipy.stats`R: `pwr`, `gsDesign` (for group sequential)Online Calculators: Evan Miller's, Optimizely's Stats Engine

Use these for conducting power analysis, sample size calculation, and implementing frequentist or sequential test boundaries. `gsDesign` is specifically for designing group sequential tests with efficacy and futility stopping rules.

Experimentation Platforms

LaunchDarkly (feature flags & targeting)Optimizely (web/mobile experimentation)internal experimentation platforms at large tech companies (e.g., Google's Overlock, Microsoft's EXP)

These platforms handle randomization, exposure logging, metric computation, and often provide built-in statistical analysis (frequentist or Bayesian). Essential for running tests at scale with proper guardrails.

Mental Models & Methodologies

Hypothesis-Driven DevelopmentThe 'Infinite Pipeline' mindset for experimentation velocityCausal Inference frameworks (e.g., potential outcomes) for thinking beyond simple A/B tests

These are the conceptual frameworks for prioritizing what to test, managing an experimentation portfolio, and understanding the deeper causal questions your tests can and cannot answer.

Interview Questions

Answer Strategy

The interviewer is testing your ability to translate business constraints into statistical and design trade-offs. The correct answer involves proposing a negotiation, not just accepting the slow timeline. **Sample Answer**: 'The 12-week timeline is likely unacceptable. I would initiate a trade-off discussion: (1) **Increase MDE**: Could we accept a 10% relative lift (to 2.2%) instead? This would cut the sample size to ~150k, finishing in 3 weeks. (2) **Choose a Better Metric**: Is there a more sensitive upstream metric (e.g., click-to-start-checkout) that has a higher baseline rate and would need fewer users? (3) **Use a Sequential Method**: Implement a Bayesian approach with a stopping rule to potentially conclude earlier if we see strong evidence. I'd present these options with the statistical trade-offs to the PM and engineering lead.'

Answer Strategy

This question probes your rigor and understanding of common pitfalls. You must demonstrate a checklist mentality beyond the p-value. **Sample Answer**: 'First, I check for **Sample Ratio Mismatch (SRM)** to ensure the randomization held and we didn't lose users differentially. Second, I examine the **trend over time**; a lift that appears abruptly or decays suggests a novelty or primacy effect, not a lasting change. Third, I verify the **primary metric definition** and data pipeline integrity-was there any logging error or change in metric calculation during the test? Only after these checks would I trust the result.'