Deployment Runbook¶
Ops reference: Step-by-step procedures for deploying, verifying, and rolling back Engramia in production. For background on configuration options see Environment Variables. For backup procedures see Backup & Restore.
Prerequisites¶
System requirements¶
- Docker 24+ and Docker Compose v2 (
docker compose, notdocker-compose) - Access to the production VM via SSH
DEPLOY_PATHon the VM containsdocker-compose.prod.ymland.env
Required environment variables (.env)¶
| Variable | Purpose |
|---|---|
ENGRAMIA_DATABASE_URL |
PostgreSQL connection string |
ENGRAMIA_API_KEYS |
Comma-separated bearer tokens for API auth |
OPENAI_API_KEY |
LLM + embedding provider |
ENGRAMIA_AUTH_MODE |
db for production (multi-tenant) |
ENGRAMIA_BOOTSTRAP_TOKEN |
One-time admin key — unset after first use |
Optional but recommended:
| Variable | Purpose |
|---|---|
STRIPE_SECRET_KEY |
Billing (Phase 6) |
STRIPE_WEBHOOK_SECRET |
Stripe webhook verification |
ENGRAMIA_METRICS=true |
Enable Prometheus /metrics endpoint |
ENGRAMIA_METRICS_TOKEN |
Bearer token protecting /metrics |
ENGRAMIA_JSON_LOGS=true |
Structured JSON log output |
Database¶
PostgreSQL 15+ with the pgvector extension. The pgvector Docker service in
docker-compose.prod.yml includes this. Run migrations once before first start:
Standard deployment procedure¶
This is the procedure the CI/CD pipeline executes automatically on release. Run it manually if deploying a hotfix or if CI is unavailable.
# 1. SSH into the production VM
ssh deploy@<host>
cd /opt/engramia
# 2. Set the target version
export IMAGE_TAG=0.6.1 # without leading 'v'
# 3. (Optional) Enable maintenance mode — prevents traffic during migration
echo "ENGRAMIA_MAINTENANCE=true" >> .env
docker compose -f docker-compose.prod.yml up -d engramia-api
# 4. Pull the new image
echo "$GHCR_TOKEN" | docker login ghcr.io -u <actor> --password-stdin
IMAGE_TAG=$IMAGE_TAG docker compose -f docker-compose.prod.yml pull engramia-api
# 5. Restart with the new image
IMAGE_TAG=$IMAGE_TAG docker compose -f docker-compose.prod.yml up -d engramia-api caddy
# 6. Run database migrations
IMAGE_TAG=$IMAGE_TAG docker compose -f docker-compose.prod.yml exec -T engramia-api alembic upgrade head
# 7. Remove maintenance mode
sed -i '/ENGRAMIA_MAINTENANCE/d' .env
docker compose -f docker-compose.prod.yml up -d engramia-api
# 8. Prune old images
docker image prune -f
Health check verification¶
Run these after every deploy before closing the deployment window.
BASE=https://api.engramia.dev
# Version matches the release tag
curl -sf $BASE/v1/version | python3 -c "import sys,json; d=json.load(sys.stdin); print(d['app_version'])"
# Shallow health (always fast)
curl -sf $BASE/v1/health | python3 -m json.tool
# Deep health — probes storage, LLM, embeddings, Stripe, and migration version
curl -sf $BASE/v1/health/deep | python3 -m json.tool
Expected output for a healthy deployment:
{
"status": "ok",
"version": "0.6.1",
"uptime_seconds": 12.4,
"checks": {
"storage": {"status": "ok", "latency_ms": 3.1},
"llm": {"status": "ok", "latency_ms": 820.5},
"embedding": {"status": "ok", "latency_ms": 210.3},
"stripe": {"status": "ok", "latency_ms": 145.2},
"migration": {"status": "ok", "latency_ms": 2.0}
}
}
A "status": "degraded" response (HTTP 200) means at least one non-critical
probe failed. A "status": "error" response (HTTP 503) means all probes failed
— investigate before routing traffic.
Zero-downtime deployment notes¶
The production docker-compose.prod.yml runs a single API container. True
zero-downtime requires either:
- Kubernetes — use
kubectl rolloutwith replica counts ≥ 2 (seedeploy/k8s/engramia.yaml) - Docker Compose with Caddy — Caddy buffers connections; the restart gap is typically < 2 s, acceptable for most SLA targets
For migrations that add columns or tables (non-destructive), the old container can run against the new schema without errors. For migrations that drop columns, enable maintenance mode first (step 3 above).
Rollback procedure¶
Use this when a deploy causes regressions or the deep health check returns
"error".
ssh deploy@<host>
cd /opt/engramia
# 1. Enter maintenance mode immediately
echo "ENGRAMIA_MAINTENANCE=true" >> .env
docker compose -f docker-compose.prod.yml up -d engramia-api
# 2. Take a database snapshot before touching migrations
docker compose -f docker-compose.prod.yml exec pgvector \
pg_dump -U engramia engramia -Fc > pre_rollback_$(date +%Y%m%d%H%M).dump
# 3. Identify the previous working version (check release tags or git log)
export PREV_TAG=0.6.0
# 4. Downgrade migrations introduced by the bad version (if any)
docker compose -f docker-compose.prod.yml exec -T engramia-api alembic downgrade -1
# 5. Update IMAGE_TAG and restart
sed -i "s/IMAGE_TAG=.*/IMAGE_TAG=$PREV_TAG/" .env # if pinned in .env
# OR edit docker-compose.prod.yml to pin the image tag directly
IMAGE_TAG=$PREV_TAG docker compose -f docker-compose.prod.yml up -d engramia-api caddy
# 6. Verify the rollback succeeded
curl -sf https://api.engramia.dev/v1/version
curl -sf https://api.engramia.dev/v1/health/deep | python3 -m json.tool
# 7. Exit maintenance mode
sed -i '/ENGRAMIA_MAINTENANCE/d' .env
docker compose -f docker-compose.prod.yml up -d engramia-api
Database migration rollback¶
# Show current migration head
docker compose -f docker-compose.prod.yml exec -T engramia-api alembic current
# List all revisions (newest first)
docker compose -f docker-compose.prod.yml exec -T engramia-api alembic history --verbose
# Roll back one revision
docker compose -f docker-compose.prod.yml exec -T engramia-api alembic downgrade -1
# Roll back to a specific revision ID
docker compose -f docker-compose.prod.yml exec -T engramia-api alembic downgrade 012
# Roll back all the way to an empty database
docker compose -f docker-compose.prod.yml exec -T engramia-api alembic downgrade base
Warning
Always take a pg_dump before running alembic downgrade. Some downgrade
scripts drop columns or tables; data lost this way cannot be recovered
without a backup.
Troubleshooting common issues¶
Container exits immediately after start¶
# Inspect the last 100 log lines
docker compose -f docker-compose.prod.yml logs --tail=100 engramia-api
Common causes:
| Symptom | Likely cause | Fix |
|---|---|---|
ENGRAMIA_DATABASE_URL not set |
Missing env var | Add to .env |
connection refused on startup |
pgvector not ready | Wait for pg_isready, or add depends_on health check |
alembic.util.exc.CommandError |
Migration conflict | Run alembic current to check state |
ModuleNotFoundError: stripe |
Missing optional dep | Rebuild image or use pip install engramia[billing] |
Deep health returns storage: error¶
# Check pgvector container is healthy
docker compose -f docker-compose.prod.yml ps pgvector
# Connect manually to verify
docker compose -f docker-compose.prod.yml exec pgvector \
psql -U engramia -c "SELECT version();"
Deep health returns migration: error¶
The running container reports a different migration revision than the code expects. This usually means migrations were not applied after the last deploy.
Deep health returns stripe: error¶
Stripe API is unreachable or the secret key is invalid.
# Verify STRIPE_SECRET_KEY is set and starts with sk_live_ or sk_test_
grep STRIPE_SECRET_KEY .env
# Test reachability from the VM
curl -sf https://api.stripe.com/v1/
API returns 503 for all endpoints¶
If ENGRAMIA_MAINTENANCE=true is set in .env, all endpoints except
/v1/health and /v1/health/deep return 503. Remove the variable and restart:
Rate limit errors (429) under load¶
The default rate limit is 60 req/min per IP for standard endpoints and 10 req/min for LLM-intensive endpoints. Adjust via:
Note: Rate limiting is per-process and in-memory. For multi-instance deployments, configure an external rate limiter upstream (API gateway, WAF).