Devoured - April 30, 2026
A/B Testing Pitfalls: What Works and What Doesn't with Real Data (5 minute read)

A/B Testing Pitfalls: What Works and What Doesn't with Real Data (5 minute read)

Data Read original

Most A/B test failures stem from broken infrastructure and poor experimentation practices rather than bad product ideas, with issues like data quality bugs and early peeking invalidating results far more often than teams realize.

What: An article examining the four major pitfalls that cause A/B tests to fail in production: Sample Ratio Mismatch from broken randomization, early peeking that inflates false positives from 5% to 25%, insufficient statistical power, and optimizing wrong metrics without guardrails. It covers solutions used by companies like Netflix, Microsoft, and Booking.com including variance reduction techniques, sequential testing methods, and automated data quality checks.
Why it matters: Teams routinely ship features based on misleading test results because they skip data quality checks, peek at results too early, or optimize vanity metrics that boost short-term engagement while harming long-term retention. The gap between effective and ineffective experimentation isn't statistical sophistication but operational discipline like automated SRM checks and pre-registered metrics.
Takeaway: Implement automated Sample Ratio Mismatch checks before analyzing any test results, predefine stopping rules using sequential testing methods instead of checking p-values daily, and establish guardrail metrics to catch unintended consequences on retention and user satisfaction.
Deep dive
  • Sample Ratio Mismatch (SRM) is a critical early warning sign that randomization is broken, with even small deviations like 52/48 instead of 50/50 indicating data quality issues that invalidate results
  • Microsoft and DoorDash case studies show SRM often reveals logging failures, biased traffic routing, or time-based bucketing bugs that create phantom wins
  • Checking test results daily (peeking) transforms a 5% false positive rate into 25% or higher by running multiple comparisons without statistical adjustment
  • Sequential testing methods like group sequential tests, always-valid p-values, and anytime-valid confidence sequences allow safe continuous monitoring while preserving Type I error guarantees
  • CUPED (Controlled-experiment Using Pre-Experiment Data) reduces variance by 40-50% by using pre-experiment behavior as a covariate, equivalent to adding 20% more traffic without actually collecting more data
  • The technique works by adjusting metrics based on pre-existing user patterns, measuring only the treatment effect rather than pre-existing variance
  • Guardrail metrics catch unintended consequences like Airbnb's case where a test increased bookings but decreased review ratings, flagging about 5 major negative impacts monthly
  • Novelty effects cause users to engage with new features simply because they're new, requiring long-term holdout groups (5-10% of users) to validate whether effects persist beyond initial curiosity
  • Top experimentation teams at Booking.com run 1,000+ concurrent tests with 90% failure rates, measuring success by test velocity and data quality rather than win rate
  • Best practices include pre-registering all metrics before tests start, running postmortems on every launch regardless of outcome, and using centralized platforms that enforce randomization correctness
  • Modern platforms like Optimizely and Statsig automatically run SRM tests with no override option, treating data quality checks as non-negotiable guardrails
  • The cultural challenge is greater than the statistical one: teams must resist the temptation to peek early, ignore warnings, or ship wins without validation
  • CUPED shouldn't be used for new user acquisition tests or when pre-period data is unavailable or unstable, but works best for established users with stable metrics
  • Companies structure guardrails into three tiers: revenue/engagement (must not decrease), user experience metrics (NPS, load time), and operational metrics (support tickets, errors)
  • Testing volume matters more than win rate because the goal is learning faster than competitors, not maximizing successful launches
Decoder
  • Sample Ratio Mismatch (SRM): When the actual split of users between control and treatment groups deviates from the expected ratio (like 52/48 instead of 50/50), indicating broken randomization or data quality issues
  • CUPED: Controlled-experiment Using Pre-Experiment Data, a variance reduction technique that uses user behavior before the test to reduce noise and shrink confidence intervals by 40-50%
  • Sequential testing: Statistical methods that allow checking test results multiple times without inflating false positive rates, unlike traditional fixed-horizon tests
  • Guardrail metrics: Secondary metrics monitored to catch unintended negative consequences, not optimized for but used as safety nets (like retention, NPS, error rates)
  • p-value peeking: The practice of repeatedly checking statistical significance during a test, which inflates false positives from 5% to 25%+ when done without proper adjustment
  • Novelty effect: Short-term engagement increases that occur because users interact with new features out of curiosity rather than genuine preference
  • Holdout group: A portion of users (typically 5-10%) kept in the control experience after launch to measure whether test effects persist long-term
  • Alpha spending function: A method in group sequential tests that optimally allocates Type I error across multiple interim looks at the data
Original article

A/B testing failures are far more often caused by broken infrastructure and poor experimentation practices than by the ideas being tested. Common failures include Sample Ratio Mismatch (SRM) from bad randomization, early peeking that inflates false positives, insufficient statistical power, and optimizing the wrong metrics without guardrails, causing misleading results.