Runbook: LLM Provider Outage¶
Severity: P2-Medium (core learn/recall unaffected)
Impact¶
LLM-dependent operations fail with 503:
- POST /v1/evaluate — multi-evaluator scoring
- POST /v1/compose — pipeline decomposition
- POST /v1/evolve — prompt evolution
Unaffected (no LLM needed):
- POST /v1/learn / POST /v1/recall — pattern storage and retrieval
- GET /v1/health / GET /v1/metrics — monitoring
- POST /v1/aging / POST /v1/feedback/decay — maintenance
- All key management, governance, analytics endpoints
Diagnostics¶
# Check deep health for LLM status
curl -s http://localhost:8000/v1/health/deep | jq '.checks.llm'
# Check provider status pages
# OpenAI: https://status.openai.com
# Anthropic: https://status.anthropic.com
# Check logs for LLM errors
docker compose logs engramia-api --since 10m | grep -i "llm\|openai\|anthropic" | tail -20
Mitigation¶
Option 1: Wait for recovery (recommended)¶
Built-in retry logic (3 attempts with exponential backoff) handles transient failures. Most outages resolve within 15-30 minutes.
Option 2: Switch provider¶
If using OpenAI and Anthropic is available (or vice versa):
# Update .env
ENGRAMIA_LLM_PROVIDER=anthropic
ENGRAMIA_LLM_MODEL=claude-sonnet-4-6
# Restart
docker compose restart engramia-api
Option 3: Disable LLM features¶
# Set provider to none — LLM endpoints return 501 instead of timing out
ENGRAMIA_LLM_PROVIDER=none
docker compose restart engramia-api
Recovery¶
- LLM endpoints auto-recover when the provider returns
- No data loss during outage (patterns, embeddings, analytics unaffected)
- Async jobs that failed due to LLM outage can be retried via the jobs API