AWS Bedrock · LLM Pipeline · Production

Customer Service LLM Summarization

Event-driven pipeline that processes 1.3M customer service interactions per year and turns them into structured features for downstream ML models. The system runs on AWS Bedrock and has been in production for 18 months across 5 countries.

Interactions / Year

1.3M

Success Rate

99.7%

Avg Latency

1.2s

Cost per Call

$0.003

The problem

Customer service interactions were a missing signal in our churn models. We had 1.3M of them per year across 5 countries, but they were raw text — useful only if you could extract the right features at scale. Manual labeling wasn't viable, and training a classification model required labeled data we didn't have.

The challenge was doing this cheaply. A naive Bedrock setup would have cost ~$4,000/month just in API calls. Multi-language support added complexity — we needed consistent output structure across Dutch, French, Spanish, Polish, and a few others, without running separate fine-tuned models per country.

Design constraints that shaped the solution:

~3,600 interactions per day, latency-tolerant (minutes, not ms)
Output must be machine-readable for feature pipelines downstream
Schema violations can't silently corrupt model training data
Infrastructure needs to be maintainable by one engineer

Architecture

EventBridge triggers Lambda whenever a new interaction lands in S3. Lambda batches the requests, calls Bedrock, and runs the output through a schema validator before writing to the SageMaker feature store. Failed events go to a dead letter queue for inspection.

The key design decision was treating this as a batch pipeline rather than real-time — which unlocked batching and prompt caching. The business didn't need these features within seconds, so optimising for throughput and cost rather than latency was the right call.

Implementation details

Prompt caching strategy

The big cost lever. Bedrock supports prompt caching on the system prompt portion — if that part is identical across calls, you only pay to process it once. I structured prompts so the instruction block is fixed (and cached) and only the interaction text varies. This alone cut costs by 40% and is the reason we hit $0.003/call at scale.

bedrock_client.py

import boto3
import json

class BedrockClient:
    def __init__(self):
        self.client = boto3.client('bedrock-runtime')
        self.model_id = 'anthropic.claude-3-sonnet-20240229-v1:0'

    def extract_features(self, interaction_text: str) -> dict:
        prompt = f"""Extract structured info from this customer interaction:
        - summary (max 200 chars)
        - sentiment: positive/neutral/negative
        - topic
        - intent
        - resolution_status: resolved/escalated/pending

        Interaction: {interaction_text}

        Return as JSON."""

        response = self.client.invoke_model(
            modelId=self.model_id,
            body=json.dumps({
                "anthropic_version": "bedrock-2023-05-31",
                "max_tokens": 500,
                "temperature": 0.1,
                "messages": [{"role": "user", "content": prompt}]
            })
        )

        return json.loads(response['body'].read())

Output schema & validation

The output parser is one of the more critical pieces — downstream models assume clean input, so any schema violation needs to be caught here, not silently passed through. The validator checks field presence, type, and value ranges before writing to the feature store.

schema.json

{
  "interaction_id": "uuid",
  "summary": "string (max 200 chars)",
  "sentiment": "positive|neutral|negative",
  "primary_topic": "string",
  "intent": "string",
  "resolution_status": "resolved|escalated|pending",
  "urgency_score": "float [0-1]",
  "csat_indicator": "float [0-1]"
}

Results

Volume

1.3M interactions processed annually
~3,600 per day
5 countries
10+ languages

Performance

Average latency: 1.2s (p95: 2.8s)
Success rate: 99.7%
Cost: $0.003 per interaction
40% cost reduction via caching

Downstream impact

Features feed 5+ downstream models
Churn model AUC improved 0.03
95% reduction in manual labeling
Enabled personalization pipeline

Organisational

Established LLM patterns for the team
Reusable monitoring infrastructure
Template used for 2 follow-on projects

What I'd do differently

I underestimated how much language affects extraction quality. Dutch irony and Spanish formality patterns produce different sentiment signals than English — spent two months calibrating this per-country after launch.

The DLQ was an afterthought initially. Made it a first-class citizen after a batch failure silently dropped 4 hours of data. Now anything that fails surfaces immediately.

Low temperature (0.1) was the right call for structured extraction but meant the summaries were occasionally too terse. A small human eval corpus per language would have caught this earlier.

Stack

AWS Bedrock (Claude)LambdaEventBridgeSageMakerMLflowSnowflakePythonPolarsTerraform