Model Performance Monitoring & Alerting System

2025 · Terraform · MLflow · S3 · AMP · SNS · GitLab · SageMaker Pipelines

Impact

  • 20 models, 400+ pipelines, and 5 countries tracked in a single MLflow instance; 12 stakeholders alerted on critical performance issues via Teams
  • MLflow has been running for over a year, alerting for half a year; the platform is stable enough that the team now applies incremental improvements frequently rather than firefighting

Business Problem

With 20+ models running across 5 countries and 400+ data pipelines, there was no central source of truth for model metrics or artifacts. Performance was tracked in isolation per team, making it nearly impossible to compare models or catch degradation early. Issues surfaced when business stakeholders noticed drops in campaign results, not when the engineering team could still act.

The missing piece was not alerting in isolation. It was consistent metrics and a shared artifact store across all pipelines, and a unified view of model health that any team could trust.

Solution Design

A central MLflow tracking server running on SageMaker, with S3 as the artifact store and the full infrastructure provisioned via Terraform. All pipelines log metrics on every run to the same server. A SageMaker Pipeline evaluation step compares each run against a hard threshold and a benchmark experiment run, so a model can pass the absolute minimum and still be flagged if it has regressed against its own previous baseline.

When a threshold is breached, the evaluation step posts to Amazon Managed Prometheus (AMP), where Alertmanager applies deduplication and throttling per model group and publishes to an SNS topic. Teams channels subscribe directly to the topic. With 20+ models emitting signals simultaneously, deduplication is what keeps the alerting channel usable.

MLOps monitoring & alerting architecture
architecture_v5.svg

Technical Challenges

Infrastructure from scratch. The full stack had to be provisioned via Terraform: MLflow tracking server on SageMaker, S3 artifact store, AMP workspace with alertmanager.yml rules, and SNS topic. This meant managing IAM roles, VPC routing, and storage lifecycle policies as code, while keeping the setup reproducible across environments.

Multiple pipeline triggers. Batch scoring pipelines, scheduled retrains, and GitLab CI/CD jobs all needed to log metrics to the same tracking server. Integrating heterogeneous triggers without duplicating instrumentation logic required a thin shared logging client used across all pipeline types.

Alert fatigue. With 20+ models each emitting performance signals, naive alerting produces noise. Throttling and deduplication rules in Alertmanager ensure flapping signals fire once rather than repeatedly, and correlated issues across models are grouped rather than broadcast individually.

Status

  • 20+ models monitored continuously across 5 countries
  • 400+ pipelines monitored with MLflow: training, tuning, and scoring pipelines tracked on model performance and feature drift; analytics automation pipelines tracked on data size, feature drift, and A/B test conversion and retention metrics
  • Throttled alerting in production: performance benchmark and hard threshold evaluation across all models, max one daily warning when triggered and immediate notification for critical issues

Next Steps

  • Data drift: more granular feature-level drift detection with actionable signal rather than binary breach notifications
  • Shadow modelling: run challenger models in parallel automatically, compare against production, and trigger retraining when the challenger consistently outperforms
  • Downstream dashboards: surface model health and drift signals directly in business dashboards so country teams can act on trends without needing to interpret MLflow runs
  • Weekly performance digests: scheduled summary of model health across all models and countries, sent automatically to stakeholders without requiring them to pull reports manually