Customer Service Conversation Wrap-Up Automation Pipeline

2024 · AWS Bedrock · Lambda · Snowflake · MLflow · Python

Impact

  • 1.3M conversations per year processed automatically across inbound customer service
  • 10% reduction in agent call handling time, with wrap-up notes no longer written manually after each call
  • New and higher-quality insights on contact reason, sentiment, churn risk, and agent-customer dynamics, replacing inconsistent manual classifications

Business Problem

After every customer service call, agents manually wrote a wrap-up note and picked a topic category. Across thousands of interactions per day, this created two compounding problems: significant unproductive post-call time per agent, and inconsistent output that made downstream analytics unreliable. Two agents handling identical calls would classify them differently, corrupting any reporting built on CRM data.

The raw transcripts already contained everything needed to fill the wrap-up form. The challenge was extracting it reliably at scale with structured output that matched CRM field requirements, without sending unmasked personal data outside the customer service platform, and without disrupting the existing agent workflow.

Solution Design

Conversations are ingested in near real-time. Each transcript is PII-masked before any model processing touches it — raw conversations are never stored. An LLM extraction step then produces five summary texts covering call reason, issue resolution, sentiment, churn indicators, and agent-customer dynamics, plus ten structured classification features including main and sub call reason, sentiment trend, agent effort, resolution type, and churn indication.

An automated LLM evaluator scores every extraction for coherence, completeness, and schema validity. Results below threshold go to a dead-letter queue rather than silently corrupting CRM records. Validated output is written to Snowflake for downstream consumption and pushed back to the CRM for automated wrap-up. A sampled subset goes to human evaluators for ongoing quality calibration.

Wrap-up automation pipeline architecture
architecture_v5.svg

Technical Challenges

PII masking as a hard boundary. All transcripts must be masked before any model call. The masking step runs in a dedicated Lambda before Bedrock is invoked, and the raw conversation never leaves the ingestion boundary. This is non-negotiable from a governance standpoint, and adds latency that constrains the window available for the downstream CRM write.

Output schema stability. The CRM accepts a fixed field schema with strict enum values. The LLM must return valid JSON with no hallucinated enum values, every time. Structured output mode combined with a validation step before the CRM writer runs catches schema violations. A single bad write corrupts a CRM record, so there is no fallback to a partial or approximate write.

Reducing LLM evaluation variance. A single evaluator model is too unstable for a quality gate at this scale. The solution uses three different LLM models scoring each output independently, with majority voting applied when scores diverge. This runs weekly on a 10k-conversation sample rather than every call. Summaries are scored on factfulness, completeness, and helpfulness. Classifications get a 1-5 quality score per feature. Prompt versions and all metrics are tracked in MLflow so regressions are caught across prompt iterations, not just within a single run.

Taxonomy alignment as a prerequisite. Ten structured features each need an agreed definition before the model can be prompted reliably. Early iterations showed that ambiguous feature definitions produced inconsistent output regardless of model quality. Aligning Customer Service, Analytics, and ML Platform on a shared taxonomy turned out to be the hardest problem, ahead of the technical work.

Status

  • Running at scale: 1.3M conversations per year across inbound customer service
  • Ten structured classification features and five summary texts produced per conversation
  • Automated LLM evaluation on every extraction; sampled human evaluation used for calibration
  • Output stored in Snowflake and written back to the CRM for automated wrap-up

Next Steps

  • Fine-tune classification models on domain-specific conversation data to improve accuracy on business-critical labels such as main call reason and churn indication
  • Fine-tune the evaluation model to reduce calibration overhead and improve consistency across languages
  • Scale to other countries and outbound conversations, which involve different CRM integrations and conversation structures
  • Expand downstream use: product improvement signals, new topic classifications, and deeper integration with customer service analytics and reporting dashboards