Runbook: High Error Rates¶
Severity: P2-Medium (P1 if sustained >5 min or >50% of requests)
Symptoms¶
- Elevated 5xx responses in Prometheus (
engramia_requests_total{status_code=~"5.."}) - Increased error-level logs in structured output
- Health check returning
degradedorerror
Diagnostics¶
1. Identify error source¶
# Check deep health
curl -s http://localhost:8000/v1/health/deep | jq .
# Check structured logs for error patterns
docker compose logs engramia-api --since 10m | grep '"level":"ERROR"' | jq -r '.message' | sort | uniq -c | sort -rn
2. Classify errors¶
| Pattern | Likely Cause | Action |
|---|---|---|
ProviderError: OpenAI |
LLM provider outage | See LLM Provider Outage |
StorageError |
Database connectivity | See database-recovery.md |
ValidationError |
Bad client input | Check API consumers; not a system issue |
QuotaExceededError |
Tenant hitting limits | Review quotas; increase if legitimate |
RateLimited |
Burst traffic | See rate-limit-tuning.md |
3. Check Prometheus metrics¶
# Error rate over 5 minutes
rate(engramia_requests_total{status_code=~"5.."}[5m])
/ rate(engramia_requests_total[5m])
# LLM call failures
rate(engramia_llm_call_duration_seconds_count{status="error"}[5m])
LLM Provider Outage¶
Impact: /evaluate, /compose, /evolve return 503. /learn, /recall, /health still work.
Immediate actions¶
- Check provider status page (status.openai.com or status.anthropic.com)
- If prolonged (>15 min), enable maintenance mode for LLM-dependent endpoints:
- Communicate to API consumers that evaluation/composition is temporarily unavailable
Recovery¶
- LLM endpoints auto-recover when provider returns; retry logic (3x with backoff) handles transient failures
- No manual intervention needed for recovery
Escalation¶
- P2 (elevated but <50%): Monitor for 15 min, investigate root cause
- P1 (>50% or all requests failing): Immediate investigation, consider rollback if recent deployment
- P0 (data corruption or security breach): Follow Incident Response Plan