Skip to content

Provider failover chain (Business+)

A failover chain is an ordered list of fallback credentials that the runtime tries when the primary credential's call fails with a transient error (5xx, timeout, network, rate limit). The chain is per-credential — your OpenAI credential can fall back to Anthropic, your Anthropic credential can fall back to a different OpenAI key, etc.

Tier: Business or Enterprise.

What triggers failover

Engramia classifies provider exceptions into two buckets:

Bucket Examples Behaviour
Auth-class AuthenticationError, BadRequestError, PermissionDeniedError, Gemini 4xx Fail fast — never failover. Surfaces rotation/revocation immediately.
Transient 5xx, timeouts, network errors, rate limits Failover — try next credential in chain.

This split is deliberate. Failing over on an auth error would mask the signal that the tenant needs to re-validate their key, and could inadvertently widen access via a different credential's scopes.

Chain shape

  • Max length: primary + 2 fallbacks = 3 chain steps total.
  • Order matters: the runtime tries entries left-to-right.
  • Same tenant only: cross-tenant references are rejected at the API layer with a 422 (defence in depth — tenant_credentials.id is a UUID, but we check the WHERE clause instead of trusting that).
  • Active only at PATCH time: a credential being revoked or marked invalid is rejected when you set the chain. Run-time degradation (a fallback turning inactive after the chain was set) is silently skipped — the chain becomes one shorter.
  • No self-reference: 422 FAILOVER_CHAIN_INVALID if the chain contains the credential's own id.
  • No duplicates: each id may appear at most once.

API

Read

The current chain is part of the credential's public view:

GET /v1/credentials/{id}
{
  "id": "cred-abc123",
  "provider": "openai",
  ...
  "failover_chain": ["cred-anthropic-key", "cred-gemini-fallback"],
  "updated_at": "2026-04-29T12:34:56.789Z"
}

The updated_at field is the ETag basis for the next write — capture it before opening an edit form.

Write — full replace

PATCH /v1/credentials/{id}/failover-chain
Content-Type: application/json
If-Match: "2026-04-29T12:34:56.789Z"

{"failover_chain": ["cred-anthropic-key", "cred-gemini-fallback"]}

Semantics:

  • The body replaces the entire chain. Send [] to disable failover.
  • If-Match is mandatory — same lost-update protection as per-role routing.
  • Validation runs in this order:
  • Body shape (max length, no duplicates, valid id format)
  • Self-reference check ({id} not in chain)
  • Per-entry tenant + active-status lookup
  • Tier gate (only when chain is non-empty)
  • ETag check
  • Apply update + invalidate caches + audit log

Self-ref is checked before the tier gate so a tenant trying to set a structurally invalid chain gets the real reason (422), not a misleading "upgrade your plan" (402).

Permission

Only owner and admin roles (permission string credentials:failover_chain:write).

Errors

HTTP error_code When
402 ENTITLEMENT_REQUIRED Tier below Business with a non-empty chain
403 FORBIDDEN Caller lacks credentials:failover_chain:write
412 PRECONDITION_FAILED If-Match does not match current updated_at
422 FAILOVER_CHAIN_INVALID Self-ref / unknown id / inactive credential
422 VALIDATION_ERROR Bad shape (length, duplicates, id format)
428 PRECONDITION_REQUIRED If-Match header missing
404 CREDENTIAL_NOT_FOUND Credential id unknown for this tenant

Combined with per-role routing

Each chain member resolves its own role_models map independently. Example: primary OpenAI maps eval -> gpt-4.1-mini; the Anthropic fallback maps eval -> claude-haiku-4-5. When OpenAI is down and the runtime falls over, the call uses claude-haiku-4-5, not the Anthropic credential's default_model.

This means set the same logical roles on every credential in the chain if you want consistent cost/quality after failover. Otherwise the fallback silently uses the credential-wide default.

Telemetry

engramia_llm_failover_total{fallback_position="1"}   # primary failed, secondary used
engramia_llm_failover_total{fallback_position="2"}   # primary + secondary failed

Use these counters to detect a primary provider outage. If fallback_position="1" rate spikes for one tenant, their primary key is having issues; if it spikes globally, the upstream provider is down.

Cache & invalidation

The chain is built once per (tenant_id, primary_provider, role) and cached. PATCHing any credential in the chain invalidates the tenant-level cache, so the next call rebuilds with the new chain shape. Cross-credential references are tracked implicitly through the tenant-scoped flush.

Downgrade behaviour

Same as per-role routing: the configured chain stays active after a downgrade, but you can no longer edit it until you upgrade. Empty-list clear is always allowed.