AWS Bedrock · LLM Pipeline · Production
Customer Service LLM Summarization
Event-driven pipeline that processes 1.3M customer service interactions per year and turns them into structured features for downstream ML models. The system runs on AWS Bedrock and has been in production for 18 months across 5 countries.
Interactions / Year
1.3M
Success Rate
99.7%
Avg Latency
1.2s
Cost per Call
$0.003
The problem
Customer service interactions were a missing signal in our churn models. We had 1.3M of them per year across 5 countries, but they were raw text — useful only if you could extract the right features at scale. Manual labeling wasn't viable, and training a classification model required labeled data we didn't have.
The challenge was doing this cheaply. A naive Bedrock setup would have cost ~$4,000/month just in API calls. Multi-language support added complexity — we needed consistent output structure across Dutch, French, Spanish, Polish, and a few others, without running separate fine-tuned models per country.
Design constraints that shaped the solution:
- ~3,600 interactions per day, latency-tolerant (minutes, not ms)
- Output must be machine-readable for feature pipelines downstream
- Schema violations can't silently corrupt model training data
- Infrastructure needs to be maintainable by one engineer
Architecture
EventBridge triggers Lambda whenever a new interaction lands in S3. Lambda batches the requests, calls Bedrock, and runs the output through a schema validator before writing to the SageMaker feature store. Failed events go to a dead letter queue for inspection.
The key design decision was treating this as a batch pipeline rather than real-time — which unlocked batching and prompt caching. The business didn't need these features within seconds, so optimising for throughput and cost rather than latency was the right call.
Implementation details
Prompt caching strategy
The big cost lever. Bedrock supports prompt caching on the system prompt portion — if that part is identical across calls, you only pay to process it once. I structured prompts so the instruction block is fixed (and cached) and only the interaction text varies. This alone cut costs by 40% and is the reason we hit $0.003/call at scale.
import boto3
import json
class BedrockClient:
def __init__(self):
self.client = boto3.client('bedrock-runtime')
self.model_id = 'anthropic.claude-3-sonnet-20240229-v1:0'
def extract_features(self, interaction_text: str) -> dict:
prompt = f"""Extract structured info from this customer interaction:
- summary (max 200 chars)
- sentiment: positive/neutral/negative
- topic
- intent
- resolution_status: resolved/escalated/pending
Interaction: {interaction_text}
Return as JSON."""
response = self.client.invoke_model(
modelId=self.model_id,
body=json.dumps({
"anthropic_version": "bedrock-2023-05-31",
"max_tokens": 500,
"temperature": 0.1,
"messages": [{"role": "user", "content": prompt}]
})
)
return json.loads(response['body'].read())Output schema & validation
The output parser is one of the more critical pieces — downstream models assume clean input, so any schema violation needs to be caught here, not silently passed through. The validator checks field presence, type, and value ranges before writing to the feature store.
{
"interaction_id": "uuid",
"summary": "string (max 200 chars)",
"sentiment": "positive|neutral|negative",
"primary_topic": "string",
"intent": "string",
"resolution_status": "resolved|escalated|pending",
"urgency_score": "float [0-1]",
"csat_indicator": "float [0-1]"
}Results
Volume
- 1.3M interactions processed annually
- ~3,600 per day
- 5 countries
- 10+ languages
Performance
- Average latency: 1.2s (p95: 2.8s)
- Success rate: 99.7%
- Cost: $0.003 per interaction
- 40% cost reduction via caching
Downstream impact
- Features feed 5+ downstream models
- Churn model AUC improved 0.03
- 95% reduction in manual labeling
- Enabled personalization pipeline
Organisational
- Established LLM patterns for the team
- Reusable monitoring infrastructure
- Template used for 2 follow-on projects
What I'd do differently
I underestimated how much language affects extraction quality. Dutch irony and Spanish formality patterns produce different sentiment signals than English — spent two months calibrating this per-country after launch.
The DLQ was an afterthought initially. Made it a first-class citizen after a batch failure silently dropped 4 hours of data. Now anything that fails surfaces immediately.
Low temperature (0.1) was the right call for structured extraction but meant the summaries were occasionally too terse. A small human eval corpus per language would have caught this earlier.