Skill Guide

A/B testing and causal inference fundamentals

A/B testing is a controlled experiment for comparing two or more variants to determine which performs better on a key metric, while causal inference is the statistical framework for establishing that one variable directly causes a change in another, moving beyond mere correlation.

Organizations leverage this skill to make data-driven product, marketing, and operational decisions that directly increase revenue, retention, and efficiency. It transforms guesswork into rigorous, evidence-based strategy, providing a competitive advantage and optimizing resource allocation.

1 Careers

1 Categories

9.0 Avg Demand

15% Avg AI Risk

How to Learn A/B testing and causal inference fundamentals

Focus on: 1. Understanding the core experimental logic: randomization, control vs. treatment groups, and key metrics (KPIs). 2. Learning the basic statistical tests (t-test for proportions/means) and what p-values and confidence intervals mean in a business context. 3. Studying one real-world A/B test case study from a company like Netflix or Booking.com.

Move to practice by: 1. Designing a complete experiment on a simulated dataset, including defining the hypothesis, selecting the randomization unit (e.g., user ID vs. session), and calculating required sample size using power analysis. 2. Recognizing and avoiding pitfalls like selection bias, novelty effects, and the multiple comparisons problem. 3. Implementing a basic difference-in-differences (DiD) analysis for observational data where a pure A/B test isn't feasible.

Master the domain by: 1. Architecting complex experiment stacks (e.g., multi-armed bandits, interleaving experiments) and managing experiment velocity at scale. 2. Applying advanced causal inference methods (e.g., instrumental variables, regression discontinuity, synthetic controls) to measure impact from complex business changes. 3. Aligning experimentation programs with executive strategy and mentoring teams on causal thinking and statistical literacy.

Practice Projects

Beginner

Project

Simulated E-Commerce Checkout Button Test

Scenario

You have a dataset from an e-commerce site. The team wants to test if changing the checkout button color from green (control) to orange (treatment) increases conversion rate.

How to Execute

1. Use Python (pandas) or R to load the simulated data. 2. Define your null (no difference) and alternative (orange is better) hypotheses. 3. Perform a two-sample t-test on the conversion rates between groups. 4. Calculate the p-value and 95% confidence interval for the difference in proportions, and interpret the business significance.

Intermediate

Case Study/Exercise

Pricing Experiment Design & Analysis

Scenario

A SaaS company wants to test a new pricing page (with a highlighted 'Pro' plan) against the current page. The randomization unit is the user account. You must design the test, analyze results for revenue per account (not just conversion), and account for users who saw both pages due to bugs.

How to Execute

1. Write a detailed experiment design document including: primary metric (RPV), secondary metrics (signup rate, churn), guardrail metrics (support tickets). 2. Calculate the minimum detectable effect (MDE) and sample size for a 2-week test. 3. Plan the analysis strategy: use a t-test on revenue, but also conduct a sanity check for Sample Ratio Mismatch (SRM) and consider a per-user analysis to handle the 'cross-over' contamination. 4. Draft the experiment launch email and pre-register your analysis plan.

Advanced

Case Study/Exercise

Causal Impact of a Major Product Launch

Scenario

The company launched a major new feature in Q2 that rolled out progressively by region, not via an A/B test. Leadership wants to quantify its causal impact on monthly active users (MAU) and revenue.

How to Execute

1. Gather time-series data for MAU and revenue for treated regions and a pool of potential control regions. 2. Use a Synthetic Control Method: construct a weighted combination of control regions that mimics the pre-launch trends of the treated regions. 3. Implement the method (using R's 'Synth' package or Python's 'CausalImpact' library) to estimate the post-launch effect. 4. Conduct robustness checks (e.g., placebo tests on other regions) and present the estimated lift with confidence intervals to stakeholders.

Tools & Frameworks

Software & Platforms

Python (statsmodels, scipy, CausalImpact)R (lme4, MatchIt, Synth)OptimizelyGoogle Analytics 4 (Experiments)SQL

Python and R are used for custom statistical analysis and advanced causal methods. Optimizely and GA4 are industry-standard platforms for running and monitoring web/mobile experiments. SQL is essential for data extraction and manipulation.

Mental Models & Methodologies

Potential Outcomes Framework (Rubin Causal Model)Directed Acyclic Graphs (DAGs)Power AnalysisDifference-in-Differences (DiD)Regression Discontinuity Design (RDD)

The Potential Outcomes Framework is the foundational statistical theory for causal inference. DAGs are used to visually map assumptions and identify confounders. Power Analysis determines the required sample size. DiD and RDD are quasi-experimental methods for when randomization is impossible.

Interview Questions

Answer Strategy

Test for novelty/learning effects. Recommend: 1. Segment results by user tenure to see if new users show a sustained effect while returning users revert. 2. Check if the change required users to learn new behavior that faded. 3. Propose a longer-term holdout test (1-2% of users) to measure long-term impact before full rollout.

Answer Strategy

Demonstrate causal inference beyond A/B testing. Response: 'I would use a quasi-experimental method. First, I'd implement a phased rollout (e.g., by sign-up week or region) to create a natural control group. Then, I'd apply Difference-in-Differences analysis comparing the change in 30-day retention between cohorts exposed to the new vs. old sequence, controlling for secular trends. I'd also use a regression discontinuity design if we have a sharp eligibility cutoff to compare users just above and below the threshold.'