International Churn Model Deployment

2023 · AWS SageMaker · GitLab CI/CD · Snowflake · XGBoost · MLflow

Impact

  • First deployment on the central ML platform, establishing the integration pattern from local data warehouse through S3 to Snowflake; the same architecture now runs 20+ models across the organization
  • Churn indication 40x better than random and 8x better than the best selection benchmark, enabling targeted retention campaigns on players genuinely at risk
  • A/B tests running weekly or monthly across multiple countries using these predictions to measure and improve retention campaign effectiveness

Business Problem

Across 5 lotteries in 4 countries, retention marketing was reactive: campaigns were sent without knowing which players were genuinely at risk of leaving. With a monthly churn rate of roughly 1.5% of the active base, the vast majority of outreach landed on players who would have stayed regardless, wasting budget and diluting campaign impact.

The harder problem was structural: each country ran its own data warehouse, its own marketing automation system, and there was no shared infrastructure for delivering a model at all. Building a country-specific solution for each would mean fragmented codebases, diverging model versions, and no scalable path forward.

Solution Design

A single XGBoost model runs on a central ML platform built on AWS and GitLab CI/CD, with Snowflake as the database layer for feature storage and score delivery. The model scores every active player daily with a churn probability for the next 30 days. Rather than per-customer tickets, it works at customer level, flagging anyone who cancels any of their active tickets as a churn event, which keeps the targeting logic consistent across all lotteries regardless of ticket count.

Training data covers 12 months of monthly snapshots with a deliberate sampling strategy: all churners are included in full, and non-churners are undersampled to reach the target fraction. This exposes the model to the full diversity of churn behavior rather than repeatedly learning from the same few positive cases.

Predictions are written to a central Snowflake output table and distributed to local data warehouses per country. Each country ingests the scores into its own warehouse and connects them to its own marketing automation tooling. Adding a new country is a config change, not a code change.

Churn model deployment architecture
architecture_v1.svg

Technical Challenges

First platform deployment, no playbook. Every integration designed from scratch: data contracts, Snowflake schema, SageMaker orchestration, and per-country output feedback loops. No prior template; sustained cross-functional alignment required across local and international teams.

Class imbalance. At 1.5% monthly churn, a model that predicts "no churn" for everyone is 98.5% accurate but useless. Standard oversampling on a static snapshot doesn't capture how churn behavior shifts across seasons and market conditions. The sampling strategy had to expose the model to the full diversity of churn patterns across time, not just the most recent or most frequent cases.

Heterogeneous stacks per country. Each lottery delivered different schemas and used different marketing automation tools. A data contract standardized inputs; output integration required custom config per country. Feature availability differences (save desk, add-ons, prize strategies) directly explain the performance range across lotteries.

Daily scoring at production cadence. Scores must be fresh every day. SageMaker Pipelines run event-driven daily scoring, with training monthly and tuning quarterly, scheduled within overnight windows.

Status

  • In production across 5 lotteries in 4 countries, with daily scoring and monthly retraining
  • 20+ models now run on the platform integration pattern established by this project
  • A/B tests running weekly or monthly in each country to measure retention campaign effectiveness

Next Steps

Most improvements identified at launch have been applied in the years since: the temporal sampling strategy, parallel hyperparameter tuning, and additional feature sets per lottery. The remaining priorities are:

  • Automatic A/B test monitoring and retention uplift modelling: a monthly pipeline that measures the actual retention effect of every campaign using causal inference and validates randomization, power, and significance automatically; see the A/B Test Evaluation and Uplift Modelling Pipeline
  • Shadow modelling via the Model Performance Monitoring & Alerting System: run challenger models in parallel with the production champion, compare performance continuously, and trigger replacement when the challenger consistently outperforms; the monitoring infrastructure is already in place