Provider failover chain (Business+)¶
A failover chain is an ordered list of fallback credentials that the runtime tries when the primary credential's call fails with a transient error (5xx, timeout, network, rate limit). The chain is per-credential — your OpenAI credential can fall back to Anthropic, your Anthropic credential can fall back to a different OpenAI key, etc.
Tier: Business or Enterprise.
What triggers failover¶
Engramia classifies provider exceptions into two buckets:
| Bucket | Examples | Behaviour |
|---|---|---|
| Auth-class | AuthenticationError, BadRequestError, PermissionDeniedError, Gemini 4xx |
Fail fast — never failover. Surfaces rotation/revocation immediately. |
| Transient | 5xx, timeouts, network errors, rate limits | Failover — try next credential in chain. |
This split is deliberate. Failing over on an auth error would mask the signal that the tenant needs to re-validate their key, and could inadvertently widen access via a different credential's scopes.
Chain shape¶
- Max length: primary + 2 fallbacks = 3 chain steps total.
- Order matters: the runtime tries entries left-to-right.
- Same tenant only: cross-tenant references are rejected at the API
layer with a 422 (defence in depth —
tenant_credentials.idis a UUID, but we check the WHERE clause instead of trusting that). - Active only at PATCH time: a credential being revoked or marked invalid is rejected when you set the chain. Run-time degradation (a fallback turning inactive after the chain was set) is silently skipped — the chain becomes one shorter.
- No self-reference: 422
FAILOVER_CHAIN_INVALIDif the chain contains the credential's own id. - No duplicates: each id may appear at most once.
API¶
Read¶
The current chain is part of the credential's public view:
{
"id": "cred-abc123",
"provider": "openai",
...
"failover_chain": ["cred-anthropic-key", "cred-gemini-fallback"],
"updated_at": "2026-04-29T12:34:56.789Z"
}
The updated_at field is the ETag basis for the next write — capture
it before opening an edit form.
Write — full replace¶
PATCH /v1/credentials/{id}/failover-chain
Content-Type: application/json
If-Match: "2026-04-29T12:34:56.789Z"
{"failover_chain": ["cred-anthropic-key", "cred-gemini-fallback"]}
Semantics:
- The body replaces the entire chain. Send
[]to disable failover. If-Matchis mandatory — same lost-update protection as per-role routing.- Validation runs in this order:
- Body shape (max length, no duplicates, valid id format)
- Self-reference check (
{id}not in chain) - Per-entry tenant + active-status lookup
- Tier gate (only when chain is non-empty)
- ETag check
- Apply update + invalidate caches + audit log
Self-ref is checked before the tier gate so a tenant trying to set a structurally invalid chain gets the real reason (422), not a misleading "upgrade your plan" (402).
Permission¶
Only owner and admin roles
(permission string credentials:failover_chain:write).
Errors¶
| HTTP | error_code |
When |
|---|---|---|
402 |
ENTITLEMENT_REQUIRED |
Tier below Business with a non-empty chain |
403 |
FORBIDDEN |
Caller lacks credentials:failover_chain:write |
412 |
PRECONDITION_FAILED |
If-Match does not match current updated_at |
422 |
FAILOVER_CHAIN_INVALID |
Self-ref / unknown id / inactive credential |
422 |
VALIDATION_ERROR |
Bad shape (length, duplicates, id format) |
428 |
PRECONDITION_REQUIRED |
If-Match header missing |
404 |
CREDENTIAL_NOT_FOUND |
Credential id unknown for this tenant |
Combined with per-role routing¶
Each chain member resolves its own role_models map independently.
Example: primary OpenAI maps eval -> gpt-4.1-mini; the Anthropic
fallback maps eval -> claude-haiku-4-5. When OpenAI is down and the
runtime falls over, the call uses claude-haiku-4-5, not the Anthropic
credential's default_model.
This means set the same logical roles on every credential in the chain if you want consistent cost/quality after failover. Otherwise the fallback silently uses the credential-wide default.
Telemetry¶
engramia_llm_failover_total{fallback_position="1"} # primary failed, secondary used
engramia_llm_failover_total{fallback_position="2"} # primary + secondary failed
Use these counters to detect a primary provider outage. If
fallback_position="1" rate spikes for one tenant, their primary key is
having issues; if it spikes globally, the upstream provider is down.
Cache & invalidation¶
The chain is built once per (tenant_id, primary_provider, role) and
cached. PATCHing any credential in the chain invalidates the
tenant-level cache, so the next call rebuilds with the new chain shape.
Cross-credential references are tracked implicitly through the
tenant-scoped flush.
Downgrade behaviour¶
Same as per-role routing: the configured chain stays active after a downgrade, but you can no longer edit it until you upgrade. Empty-list clear is always allowed.
Related¶
- per-role-routing.md — per-role model mapping
- ../architecture/credentials.md — full BYOK architecture
- ../runbooks/llm-provider-outage.md — when failover does not save you