Skip to main content

Interview Prep

AI A/B Testing Analyst Interview Questions

50 expert questions covering beginner fundamentals to advanced AI workflow scenarios. Each answer includes a hint for structured responses.

Beginner: 5Intermediate: 10Advanced: 10Scenario-Based: 10AI Workflow & Tools: 10Behavioral: 5

Beginner

5 questions
What a great answer covers:

A great answer distinguishes p-values from effect sizes and explains why a statistically significant result with a tiny effect may not justify a business decision.

What a great answer covers:

The answer should cover the counterfactual-what would have happened without the change-and how the control isolates the treatment effect.

What a great answer covers:

A strong answer clarifies that a p-value is the probability of observing data at least as extreme as the result, assuming the null hypothesis is true-not the probability the null is true.

What a great answer covers:

The answer should address confounding variables, selection bias, and how randomization ensures groups are comparable on both observed and unobserved characteristics.

What a great answer covers:

A good answer connects sample size to statistical power-the ability to detect a true effect-and explains underpowered tests risk false negatives.

Intermediate

10 questions
What a great answer covers:

The answer should discuss multi-metric trade-offs, guardrail metrics, cost-benefit analysis, and potentially a composite utility score that weights both engagement gains and cost increases.

What a great answer covers:

A strong answer covers Bonferroni correction, false discovery rate (FDR) control via Benjamini-Hochberg, or multi-armed bandit approaches as alternatives to naive pairwise testing.

What a great answer covers:

The answer should cover defining quality metrics (accuracy, relevance, hallucination rate), human evaluation or LLM-as-judge approaches, sample sizing, and the challenge of non-deterministic outputs.

What a great answer covers:

A great answer explains how initial user excitement with AI features can inflate short-term metrics, and discusses longer experiment durations or cohort-based analysis to detect it.

What a great answer covers:

The answer should cover prior incorporation, posterior distributions, credible intervals vs. confidence intervals, and practical considerations like sequential peeking and decision speed.

What a great answer covers:

A strong answer discusses running multiple evaluations per input (sampling), aggregating scores, computing variance, and using higher sample sizes or fixed random seeds for reproducibility.

What a great answer covers:

The answer should define guardrails as metrics that must not degrade beyond a threshold, with AI-specific examples like hallucination rate, toxicity detection scores, or P95 latency.

What a great answer covers:

A great answer uses a concrete example (e.g., overall conversion looks higher for treatment, but reverses when segmented by user tenure) and discusses how to detect and prevent it through proper stratification.

What a great answer covers:

The answer should explain ITT preserves randomization integrity by analyzing all users as assigned, while per-protocol only analyzes compliant users, introducing selection bias. ITT is generally preferred.

What a great answer covers:

A strong answer structures metrics into primary (e.g., search success rate), secondary (e.g., click-through rate, dwell time), and guardrail (e.g., latency, cost, safety flags), explaining how each level informs the decision.

Advanced

10 questions
What a great answer covers:

A comprehensive answer covers phased rollout, quality rubrics (human + automated), cost-quality trade-off modeling, guardrail metrics for safety, user satisfaction surveys, and a decision matrix for go/no-go thresholds.

What a great answer covers:

The answer should cover Thompson Sampling or UCB algorithms, exploration-exploitation trade-offs, reduced opportunity cost vs. A/B testing, convergence criteria, and when bandits are inappropriate (e.g., when you need clean causal estimates).

What a great answer covers:

A strong answer discusses cluster-randomized experiments, the stable unit treatment value assumption (SUTVA), graph-based interference models, or geographic randomization as mitigation strategies.

What a great answer covers:

The answer should cover rubric design, inter-rater reliability (Cohen's kappa), calibration against human judgments, position bias and verbosity bias in LLM judges, and strategies like human spot-checks and adversarial testing.

What a great answer covers:

A great answer discusses treatment contamination diluting the estimated effect, the need for complier average causal effect (CACE) analysis or instrumental variables, and whether to rerun the experiment.

What a great answer covers:

The answer should cover cuped (controlled-experiment using pre-experiment data) variance reduction, metric transformations (log, winsorization), stratified analysis, and using more granular or proximal metrics.

What a great answer covers:

A strong answer addresses holdout group management, survivorship bias, metric drift, user awareness effects, ethical considerations of withholding features, and how to maintain holdout integrity over long periods.

What a great answer covers:

The answer should discuss Bayesian approaches with informative priors, pre-post analysis, crossover designs, sequential testing with alpha spending, and accepting higher uncertainty with decision-theoretic frameworks.

What a great answer covers:

A comprehensive answer covers building an offline evaluation dataset, human preference ranking (Elo or Bradley-Terry model), automated quality metrics, cost-latency benchmarks, and a staged rollout plan starting with shadow mode.

What a great answer covers:

A great answer discusses Goodhart's Law, the need for diverse metric portfolios, qualitative user research to validate quantitative signals, adversarial testing, and monitoring for suspicious metric patterns like sudden jumps without feature changes.

Scenario-Based

10 questions
What a great answer covers:

The answer should cover educating the PM on statistical power risks, proposing alternatives like a smaller initial test with sequential analysis, a softer launch with qualitative feedback, or using a more sensitive proxy metric.

What a great answer covers:

A strong answer presents a structured decision framework weighing conversion lift against latency impact on other metrics (bounce rate, abandonment), user segment analysis, and the option to optimize latency before full rollout.

What a great answer covers:

The answer should discuss distributional analysis (not just means), user segmentation to identify who benefits and who is harmed, bimodal distribution detection, and recommendations for personalization rather than one-size-fits-all decisions.

What a great answer covers:

A great answer covers the risks of optional stopping, sequential testing methods that allow early stopping with controlled error rates, and the difference between peeking at results and pre-planned interim analyses.

What a great answer covers:

The answer should discuss the possibility of misleading AI-generated descriptions creating false expectations, the importance of the full conversion funnel, and how to design experiments with downstream revenue as the primary metric rather than engagement alone.

What a great answer covers:

The answer should cover automated metrics (BLEU, COMET) for all languages, human evaluation for the 4 available, back-translation quality checks, LLM-as-judge as a bridge, and how to prioritize languages by user volume.

What a great answer covers:

A strong answer weighs speed vs. quality trade-offs, segments by developer experience level, discusses downstream costs of bugs, and recommends potential mitigation strategies like enhanced code review tooling alongside the feature.

What a great answer covers:

The answer should cover the risk of platform migration bias, options to complete the experiment on the old platform, running parallel validation on the new platform, and documenting any methodology differences between platforms.

What a great answer covers:

The answer should discuss interrupted time series analysis, difference-in-differences with unaffected user segments as controls, causal impact models, and the limitations of experimentation in non-stationary environments.

What a great answer covers:

A great answer covers shadow mode testing (AI generates recommendations without showing them to users), retrospective evaluation against known outcomes, IRB/ethics review, equivalence testing to demonstrate non-inferiority, and phased rollouts with opt-in consent.

AI Workflow & Tools

10 questions
What a great answer covers:

The answer should cover setting up tracing runs, tagging variants with metadata, defining evaluation functions, aggregating scores across runs, and comparing distributions between control and treatment groups.

What a great answer covers:

A strong answer describes defining eval criteria (accuracy, tone, completeness), creating sample conversations, running both models, using a judge model or human raters, and computing comparative scores with confidence intervals.

What a great answer covers:

The answer should cover loading metrics, running batch evaluations, handling edge cases, combining automated metrics with human evaluation, and interpreting scores in context.

What a great answer covers:

A great answer covers the end-to-end architecture: Google BigQuery client or dbt for extraction, pandas for transformation, scipy/statsmodels for testing, and a reporting layer (Jupyter, Hex, or automated email).

What a great answer covers:

The answer should cover W&B Tables for logging inputs/outputs, custom metrics tracking, run comparison dashboards, sweep configurations for systematic testing, and artifact versioning for prompt templates.

What a great answer covers:

A strong answer covers SDK initialization, user ID-based assignment, exposure event logging, guardrail metric configuration, and the importance of logging only when a user is actually exposed to the feature.

What a great answer covers:

The answer should describe staging models for raw events, intermediate models for assignment-exposure joins, fact tables for metric computation, and testing for data quality (e.g., no users in both groups).

What a great answer covers:

A great answer covers shadow endpoint configuration, logging both model outputs for the same inputs, building an evaluation pipeline against the logged data, and comparing latency and cost characteristics.

What a great answer covers:

The answer should cover rubric design with clear scoring criteria, calibration against human ratings on a held-out set, measuring inter-rater agreement, testing for known biases (position, verbosity), and periodic human audit loops.

What a great answer covers:

A strong answer covers custom event taxonomy design, super properties for experiment group assignment, funnel analysis for AI interaction loops, and building composite engagement scores that weight AI-specific signals appropriately.

Behavioral

5 questions
What a great answer covers:

A great answer shows intellectual humility, describes how you validated the data, investigated potential issues, and ultimately let the data guide the decision even when it surprised you.

What a great answer covers:

The answer should demonstrate the ability to translate statistical concepts into business language, use clear visualizations, and focus on the decision implications rather than methodological details.

What a great answer covers:

A strong answer shows collaborative problem-solving, willingness to re-examine assumptions, and a commitment to methodological rigor while maintaining positive working relationships.

What a great answer covers:

The answer should cover how you assessed data quality, made reasonable assumptions transparently, used sensitivity analysis to test robustness of conclusions, and communicated uncertainty clearly to stakeholders.

What a great answer covers:

A great answer demonstrates genuine intellectual curiosity, specific sources (research papers, conferences, communities), and a concrete example of translating new knowledge into practice.