A/B Test Evaluation and Uplift Modelling Pipeline

2024 · SageMaker Pipelines · Snowflake · Tableau

Impact

Retention rates increased 35% since the system went live, driven by more effective incentives targeted at the right customer groups
Automated causal measurement of retention effect for every campaign, replacing manual post-hoc analysis
Runs monthly; any experiment completed in the prior period is automatically included and results kept current
Statistical validation flags per result row: properly randomized, adequately powered, statistically significant

Business Problem

Running a retention campaign tells you what happened, but not whether it caused it. Without a causal measurement layer, analysts computed uplift manually in spreadsheets with no standardized way to check randomization, statistical power, or subgroup differences. The measurement pipeline was missing: experiments were accumulating but their value remained unquantified.

Solution Design

A monthly schedule triggers the SageMaker Pipeline, pulling the latest A/B test group assignments from Snowflake.
The cohort-level step computes retention difference between test and control using bootstrapped confidence intervals.
The subgroup-level step runs one logistic regression per feature slice per experiment, with the treatment indicator interacted with the subgroup feature to isolate differential effects.
The validation step writes four binary flags per result row: randomization quality, statistical power, significance, and combined validity.
Results are written to an S3 exchange bucket and loaded into Snowflake as cohort and subgroup uplift tables. MLflow logs the run for tracking and reproducibility.
Tableau surfaces the results, filtering on validity flags so analysts only act on statistically trustworthy findings without manual triage.

Uplift modelling pipeline architecture — architecture_v1.svg

Technical Challenges

Causal validity as a hard constraint. Uplift numbers are only meaningful if the experiment was properly randomized and adequately powered. The pipeline encodes all checks in a validation step and writes flags per row, making it impossible to accidentally treat an underpowered or imbalanced experiment as a real effect.

Subgroup analysis at scale. One logistic regression runs per subgroup per month per experiment, with the treatment indicator interacted with the subgroup feature. A config-driven design controls feature selection and binning, keeping the regression count manageable as experiments grow across 5 lotteries.

Post-treatment subgrouping is unsupported by design. Slicing on outcomes that only occur after treatment, such as gift acceptance, introduces selection bias and would invalidate causal claims. The pipeline deliberately only supports pre-treatment feature splits.

Status

Running monthly across all 5 lotteries
Cohort-level and subgroup-level uplift results written to Snowflake and surfaced in Tableau
Dashboard live in preliminary form; deeper filtering and experiment tagging planned
Known limitation: some lotteries may require manual overrides for broken control group assignments

Next Steps

Richer segmentation and personalization: as uplift signals mature across segments, feed them into the Content Personalization Engine to move from indicative measurement to differentiated campaign content based on what actually drives retention per segment
Broader intervention types: experiment with a wider range of incentives including discounts, offers, and communication formats to build a richer evidence base of what works per segment
From indicative to prescriptive: currently uplift estimates are indicative and analysts translate them into campaign recommendations for marketers; the goal is to move toward a system that directly recommends the optimal intervention per customer segment without manual mediation
Tableau redesign with deep linking per experiment, KPI rollups, and uplift confidence flags
Automated validation logging per experiment so flag history is preserved across runs
ROI calculator integrating uplift estimates with campaign cost and ticket value
Non-randomized design support using inverse propensity weights for experiments where a clean control group was not maintained