A/B Test Evaluation and Uplift Modelling Pipeline

2024 · SageMaker Pipelines · Snowflake · Tableau

Impact

  • Retention rates increased 35% since the system went live, driven by more effective incentives targeted at the right customer groups
  • Automated causal measurement of retention effect for every campaign, replacing manual post-hoc analysis
  • Runs monthly; any experiment completed in the prior period is automatically included and results kept current
  • Statistical validation flags per result row: properly randomized, adequately powered, statistically significant

Business Problem

Running a retention campaign tells you what happened, but not whether it caused it. Without a causal measurement layer, analysts computed uplift manually in spreadsheets with no standardized way to check randomization, statistical power, or subgroup differences. The measurement pipeline was missing: experiments were accumulating but their value remained unquantified.

Solution Design

  1. A monthly schedule triggers the SageMaker Pipeline, pulling the latest A/B test group assignments from Snowflake.
  2. The cohort-level step computes retention difference between test and control using bootstrapped confidence intervals.
  3. The subgroup-level step runs one logistic regression per feature slice per experiment, with the treatment indicator interacted with the subgroup feature to isolate differential effects.
  4. The validation step writes four binary flags per result row: randomization quality, statistical power, significance, and combined validity.
  5. Results are written to an S3 exchange bucket and loaded into Snowflake as cohort and subgroup uplift tables. MLflow logs the run for tracking and reproducibility.
  6. Tableau surfaces the results, filtering on validity flags so analysts only act on statistically trustworthy findings without manual triage.
Uplift modelling pipeline architecture
architecture_v1.svg

Technical Challenges

Causal validity as a hard constraint. Uplift numbers are only meaningful if the experiment was properly randomized and adequately powered. The pipeline encodes all checks in a validation step and writes flags per row, making it impossible to accidentally treat an underpowered or imbalanced experiment as a real effect.

Subgroup analysis at scale. One logistic regression runs per subgroup per month per experiment, with the treatment indicator interacted with the subgroup feature. A config-driven design controls feature selection and binning, keeping the regression count manageable as experiments grow across 5 lotteries.

Post-treatment subgrouping is unsupported by design. Slicing on outcomes that only occur after treatment, such as gift acceptance, introduces selection bias and would invalidate causal claims. The pipeline deliberately only supports pre-treatment feature splits.

Status

  • Running monthly across all 5 lotteries
  • Cohort-level and subgroup-level uplift results written to Snowflake and surfaced in Tableau
  • Dashboard live in preliminary form; deeper filtering and experiment tagging planned
  • Known limitation: some lotteries may require manual overrides for broken control group assignments

Next Steps

  • Richer segmentation and personalization: as uplift signals mature across segments, feed them into the Content Personalization Engine to move from indicative measurement to differentiated campaign content based on what actually drives retention per segment
  • Broader intervention types: experiment with a wider range of incentives including discounts, offers, and communication formats to build a richer evidence base of what works per segment
  • From indicative to prescriptive: currently uplift estimates are indicative and analysts translate them into campaign recommendations for marketers; the goal is to move toward a system that directly recommends the optimal intervention per customer segment without manual mediation
  • Tableau redesign with deep linking per experiment, KPI rollups, and uplift confidence flags
  • Automated validation logging per experiment so flag history is preserved across runs
  • ROI calculator integrating uplift estimates with campaign cost and ticket value
  • Non-randomized design support using inverse propensity weights for experiments where a clean control group was not maintained