Skip to content

Deployment Runbook

Ops reference: Step-by-step procedures for deploying, verifying, and rolling back Engramia in production. For background on configuration options see Environment Variables. For backup procedures see Backup & Restore.


Prerequisites

System requirements

  • Docker 24+ and Docker Compose v2 (docker compose, not docker-compose)
  • Access to the production VM via SSH
  • DEPLOY_PATH on the VM contains docker-compose.prod.yml and .env

Required environment variables (.env)

Variable Purpose
ENGRAMIA_DATABASE_URL PostgreSQL connection string
ENGRAMIA_API_KEYS Comma-separated bearer tokens for API auth
OPENAI_API_KEY LLM + embedding provider
ENGRAMIA_AUTH_MODE db for production (multi-tenant)
ENGRAMIA_BOOTSTRAP_TOKEN One-time admin key — unset after first use

Optional but recommended:

Variable Purpose
STRIPE_SECRET_KEY Billing (Phase 6)
STRIPE_WEBHOOK_SECRET Stripe webhook verification
ENGRAMIA_METRICS=true Enable Prometheus /metrics endpoint
ENGRAMIA_METRICS_TOKEN Bearer token protecting /metrics
ENGRAMIA_JSON_LOGS=true Structured JSON log output

Database

PostgreSQL 15+ with the pgvector extension. The pgvector Docker service in docker-compose.prod.yml includes this. Run migrations once before first start:

docker compose -f docker-compose.prod.yml exec engramia-api alembic upgrade head

Standard deployment procedure

This is the procedure the CI/CD pipeline executes automatically on release. Run it manually if deploying a hotfix or if CI is unavailable.

# 1. SSH into the production VM
ssh deploy@<host>
cd /opt/engramia

# 2. Set the target version
export IMAGE_TAG=0.6.1   # without leading 'v'

# 3. (Optional) Enable maintenance mode — prevents traffic during migration
echo "ENGRAMIA_MAINTENANCE=true" >> .env
docker compose -f docker-compose.prod.yml up -d engramia-api

# 4. Pull the new image
echo "$GHCR_TOKEN" | docker login ghcr.io -u <actor> --password-stdin
IMAGE_TAG=$IMAGE_TAG docker compose -f docker-compose.prod.yml pull engramia-api

# 5. Restart with the new image
IMAGE_TAG=$IMAGE_TAG docker compose -f docker-compose.prod.yml up -d engramia-api caddy

# 6. Run database migrations
IMAGE_TAG=$IMAGE_TAG docker compose -f docker-compose.prod.yml exec -T engramia-api alembic upgrade head

# 7. Remove maintenance mode
sed -i '/ENGRAMIA_MAINTENANCE/d' .env
docker compose -f docker-compose.prod.yml up -d engramia-api

# 8. Prune old images
docker image prune -f

Health check verification

Run these after every deploy before closing the deployment window.

BASE=https://api.engramia.dev

# Version matches the release tag
curl -sf $BASE/v1/version | python3 -c "import sys,json; d=json.load(sys.stdin); print(d['app_version'])"

# Shallow health (always fast)
curl -sf $BASE/v1/health | python3 -m json.tool

# Deep health — probes storage, LLM, embeddings, Stripe, and migration version
curl -sf $BASE/v1/health/deep | python3 -m json.tool

Expected output for a healthy deployment:

{
  "status": "ok",
  "version": "0.6.1",
  "uptime_seconds": 12.4,
  "checks": {
    "storage": {"status": "ok", "latency_ms": 3.1},
    "llm": {"status": "ok", "latency_ms": 820.5},
    "embedding": {"status": "ok", "latency_ms": 210.3},
    "stripe": {"status": "ok", "latency_ms": 145.2},
    "migration": {"status": "ok", "latency_ms": 2.0}
  }
}

A "status": "degraded" response (HTTP 200) means at least one non-critical probe failed. A "status": "error" response (HTTP 503) means all probes failed — investigate before routing traffic.


Zero-downtime deployment notes

The production docker-compose.prod.yml runs a single API container. True zero-downtime requires either:

  • Kubernetes — use kubectl rollout with replica counts ≥ 2 (see deploy/k8s/engramia.yaml)
  • Docker Compose with Caddy — Caddy buffers connections; the restart gap is typically < 2 s, acceptable for most SLA targets

For migrations that add columns or tables (non-destructive), the old container can run against the new schema without errors. For migrations that drop columns, enable maintenance mode first (step 3 above).


Rollback procedure

Use this when a deploy causes regressions or the deep health check returns "error".

ssh deploy@<host>
cd /opt/engramia

# 1. Enter maintenance mode immediately
echo "ENGRAMIA_MAINTENANCE=true" >> .env
docker compose -f docker-compose.prod.yml up -d engramia-api

# 2. Take a database snapshot before touching migrations
docker compose -f docker-compose.prod.yml exec pgvector \
  pg_dump -U engramia engramia -Fc > pre_rollback_$(date +%Y%m%d%H%M).dump

# 3. Identify the previous working version (check release tags or git log)
export PREV_TAG=0.6.0

# 4. Downgrade migrations introduced by the bad version (if any)
docker compose -f docker-compose.prod.yml exec -T engramia-api alembic downgrade -1

# 5. Update IMAGE_TAG and restart
sed -i "s/IMAGE_TAG=.*/IMAGE_TAG=$PREV_TAG/" .env   # if pinned in .env
# OR edit docker-compose.prod.yml to pin the image tag directly

IMAGE_TAG=$PREV_TAG docker compose -f docker-compose.prod.yml up -d engramia-api caddy

# 6. Verify the rollback succeeded
curl -sf https://api.engramia.dev/v1/version
curl -sf https://api.engramia.dev/v1/health/deep | python3 -m json.tool

# 7. Exit maintenance mode
sed -i '/ENGRAMIA_MAINTENANCE/d' .env
docker compose -f docker-compose.prod.yml up -d engramia-api

Database migration rollback

# Show current migration head
docker compose -f docker-compose.prod.yml exec -T engramia-api alembic current

# List all revisions (newest first)
docker compose -f docker-compose.prod.yml exec -T engramia-api alembic history --verbose

# Roll back one revision
docker compose -f docker-compose.prod.yml exec -T engramia-api alembic downgrade -1

# Roll back to a specific revision ID
docker compose -f docker-compose.prod.yml exec -T engramia-api alembic downgrade 012

# Roll back all the way to an empty database
docker compose -f docker-compose.prod.yml exec -T engramia-api alembic downgrade base

Warning

Always take a pg_dump before running alembic downgrade. Some downgrade scripts drop columns or tables; data lost this way cannot be recovered without a backup.


Troubleshooting common issues

Container exits immediately after start

# Inspect the last 100 log lines
docker compose -f docker-compose.prod.yml logs --tail=100 engramia-api

Common causes:

Symptom Likely cause Fix
ENGRAMIA_DATABASE_URL not set Missing env var Add to .env
connection refused on startup pgvector not ready Wait for pg_isready, or add depends_on health check
alembic.util.exc.CommandError Migration conflict Run alembic current to check state
ModuleNotFoundError: stripe Missing optional dep Rebuild image or use pip install engramia[billing]

Deep health returns storage: error

# Check pgvector container is healthy
docker compose -f docker-compose.prod.yml ps pgvector

# Connect manually to verify
docker compose -f docker-compose.prod.yml exec pgvector \
  psql -U engramia -c "SELECT version();"

Deep health returns migration: error

The running container reports a different migration revision than the code expects. This usually means migrations were not applied after the last deploy.

docker compose -f docker-compose.prod.yml exec -T engramia-api alembic upgrade head

Deep health returns stripe: error

Stripe API is unreachable or the secret key is invalid.

# Verify STRIPE_SECRET_KEY is set and starts with sk_live_ or sk_test_
grep STRIPE_SECRET_KEY .env

# Test reachability from the VM
curl -sf https://api.stripe.com/v1/

API returns 503 for all endpoints

If ENGRAMIA_MAINTENANCE=true is set in .env, all endpoints except /v1/health and /v1/health/deep return 503. Remove the variable and restart:

sed -i '/ENGRAMIA_MAINTENANCE/d' .env
docker compose -f docker-compose.prod.yml up -d engramia-api

Rate limit errors (429) under load

The default rate limit is 60 req/min per IP for standard endpoints and 10 req/min for LLM-intensive endpoints. Adjust via:

ENGRAMIA_RATE_LIMIT_DEFAULT=120
ENGRAMIA_RATE_LIMIT_EXPENSIVE=20

Note: Rate limiting is per-process and in-memory. For multi-instance deployments, configure an external rate limiter upstream (API gateway, WAF).