Backup and Restore Playbook¶
Engramia v0.6.0 · RTO: 4 hours · RPO: 24 hours
Overview¶
| Storage backend | Backup method | Frequency | Off-site |
|---|---|---|---|
| PostgreSQL (prod) | pg_dump custom format |
Daily | Hetzner Object Storage |
| JSON storage (dev) | Directory copy | Manual | N/A |
PostgreSQL Backup¶
Manual dump¶
# On the production host
cd /opt/engramia
# Dump to local file (custom format — smaller, supports parallel restore)
docker compose -f docker-compose.prod.yml exec -T pgvector \
pg_dump -U engramia engramia -Fc \
> /opt/engramia/backups/engramia_$(date +%Y%m%d_%H%M).dump
# Verify the dump is not empty
ls -lh /opt/engramia/backups/
Automated daily backup (cron)¶
Add to /etc/cron.d/engramia-backup on the production VM:
# Daily at 03:00 — backup PostgreSQL + upload to Hetzner Object Storage
0 3 * * * root /opt/engramia/scripts/backup.sh >> /var/log/engramia-backup.log 2>&1
/opt/engramia/scripts/backup.sh:
#!/bin/bash
set -euo pipefail
BACKUP_DIR=/opt/engramia/backups
BUCKET=s3://engramia-backups
DATE=$(date +%Y%m%d_%H%M)
FILE="$BACKUP_DIR/engramia_${DATE}.dump"
KEEP_DAYS=30
# Create backup
docker compose -f /opt/engramia/docker-compose.prod.yml exec -T pgvector \
pg_dump -U engramia engramia -Fc > "$FILE"
# Upload to Hetzner Object Storage (configure aws CLI with Hetzner S3 endpoint)
aws --endpoint-url https://fsn1.your-objectstorage.com s3 cp "$FILE" "$BUCKET/"
# Prune local backups older than 7 days
find "$BACKUP_DIR" -name "*.dump" -mtime +7 -delete
# Prune remote backups older than KEEP_DAYS
aws --endpoint-url https://fsn1.your-objectstorage.com s3 ls "$BUCKET/" \
| awk '{print $4}' \
| while read -r key; do
age=$(( ($(date +%s) - $(date -d "$(echo "$key" | grep -oP '\d{8}_\d{4}' | sed 's/_/ /')" +%s)) / 86400 ))
[ "$age" -gt "$KEEP_DAYS" ] && \
aws --endpoint-url https://fsn1.your-objectstorage.com s3 rm "$BUCKET/$key"
done
echo "[$DATE] Backup complete: $FILE"
Verify backup integrity (weekly)¶
# Test restore into a temporary container — does not affect production
docker run --rm \
-e POSTGRES_USER=engramia \
-e POSTGRES_PASSWORD=test \
-e POSTGRES_DB=engramia \
pgvector/pgvector:0.7.4-pg16 &
sleep 5
pg_restore --list /opt/engramia/backups/engramia_latest.dump | wc -l
# Should print > 0 (number of archived objects)
PostgreSQL Restore¶
Full restore (disaster recovery)¶
# 1. Enter maintenance mode
ssh root@engramia-staging
cd /opt/engramia
echo "ENGRAMIA_MAINTENANCE=true" >> .env
docker compose -f docker-compose.prod.yml up -d engramia-api
# 2. Stop the API to prevent writes during restore
docker compose -f docker-compose.prod.yml stop engramia-api
# 3. Drop and recreate the database
docker compose -f docker-compose.prod.yml exec pgvector \
psql -U engramia -c "DROP DATABASE IF EXISTS engramia;"
docker compose -f docker-compose.prod.yml exec pgvector \
psql -U engramia -c "CREATE DATABASE engramia;"
# 4. Restore from dump
cat /opt/engramia/backups/engramia_YYYYMMDD_HHMM.dump \
| docker compose -f docker-compose.prod.yml exec -T pgvector \
pg_restore -U engramia -d engramia -v
# 5. Run migrations to ensure schema is at head
docker compose -f docker-compose.prod.yml run --rm engramia-api \
alembic upgrade head
# 6. Start API and exit maintenance mode
docker compose -f docker-compose.prod.yml up -d engramia-api
sed -i '/ENGRAMIA_MAINTENANCE/d' .env
docker compose -f docker-compose.prod.yml up -d
Point-in-time (migration rollback only)¶
See docs/deployment.md § Rollback strategy for Alembic downgrade procedure.
JSON Storage Backup (development)¶
# Backup
cp -r ./engramia_data ./engramia_data_backup_$(date +%Y%m%d)
# Or portable export (works across storage backends)
engramia export --path ./engramia_data --output backup.jsonl
# Restore
cp -r ./engramia_data_backup_20260101 ./engramia_data
# or
engramia import --path ./engramia_data --input backup.jsonl
RTO / RPO Targets¶
| Scenario | RPO | RTO |
|---|---|---|
| VM failure (no data loss) | 24h (last daily backup) | 2h (new VM + restore) |
| Database corruption | 24h (last daily backup) | 2h (pg_restore) |
| Accidental data deletion (single tenant) | 24h | 30min (pg_restore to temp DB, selective re-import) |
| Complete rebuild from scratch | N/A | 4h (VM + Caddy + Docker + migrate + restore) |
RPO can be improved to 1h by enabling PostgreSQL WAL archiving to object storage (requires wal_level=replica and a WAL archive destination).
Pre-Migration Backup (mandatory)¶
Always take a dump before running alembic upgrade:
docker compose -f docker-compose.prod.yml exec -T pgvector \
pg_dump -U engramia engramia -Fc \
> /opt/engramia/backups/pre_migration_$(date +%Y%m%d_%H%M).dump
The CI/CD deploy step in docker.yml runs migrations automatically. If you're deploying manually, run the backup first.
Automated Restore Testing (A12)¶
Restore testing is a launch blocker (GTM A12). A GitHub Actions workflow runs the full cycle automatically so regressions are caught before production incidents.
What the test does¶
scripts/test_backup_restore.py runs end-to-end without any external
dependencies beyond Docker and Python:
- Starts a source
pgvector/pgvector:0.7.4-pg16container on port 15441. - Applies the full Alembic migration chain (
alembic upgrade head). - Inserts test seed rows (tenant, project, two
memory_datakeys). - Runs
pg_dump→ gzip-compressed.sql.gzbackup. - Starts a target pgvector container on port 15442.
- Restores the backup with
psql. - Validates:
- pgvector extension is active.
- All 14 expected tables are present (
memory_data,tenants,jobs, …cloud_users). alembic_versionis at revision013(head).- Default tenant seeded by migration 003 is present.
- Seed tenant
brt-tenant-1and both seedmemory_datakeys are present. - Tears down both containers and deletes the temp backup file.
- Exits
0(PASS) or1(FAIL).
CI schedule¶
The workflow (.github/workflows/backup-restore-test.yml) runs:
- Weekly — every Sunday at 04:00 UTC.
- On demand — via the GitHub Actions UI (
workflow_dispatch).
Artifacts: the full run log is uploaded as brt-log-<run_id> and kept for 30
days. A Markdown job summary is posted directly to the Actions run page so
triage is possible without downloading the artifact.
Run locally¶
Prerequisites: Docker daemon running, Python 3.12, Postgres extras installed.
Expected output (abridged):
[HH:MM:SS] ============================================================
[HH:MM:SS] Engramia Backup Restore Test run=a1b2c3d4
...
[HH:MM:SS] [Phase 4/4] Running validation checks
[HH:MM:SS]
[HH:MM:SS] RESULT: PASS (42.3s)
[HH:MM:SS] ✓ pgvector extension active
[HH:MM:SS] ✓ all 14 expected tables present
[HH:MM:SS] ✓ alembic_version = 013
[HH:MM:SS] ✓ default tenant present
[HH:MM:SS] ✓ seed tenant 'brt-tenant-1' present
[HH:MM:SS] ✓ 2 seed memory_data rows present
Trigger via GitHub Actions UI¶
- Go to Actions → Backup Restore Test in the repository.
- Click Run workflow.
- Optionally fill in pgvector Docker image override (leave blank for
the default prod image
pgvector/pgvector:0.7.4-pg16). - Click Run workflow.
On failure¶
- Open the failing run; the Markdown summary shows the last 100 lines of output.
- Download artifact
brt-log-<run_id>for the full log. - Reproduce locally:
python scripts/test_backup_restore.py. - Common failure modes:
- Missing table — a new migration was added but
EXPECTED_TABLESin the script was not updated. Add the table name to the set. - Alembic revision mismatch — a new migration was merged; update
EXPECTED_ALEMBIC_REVISIONin the script. - psql restore failed — the dump format or encoding changed; check
pg_dump options in
create_backup(). - Container port conflict — set
ENGRAMIA_BRT_SOURCE_PORT/ENGRAMIA_BRT_TARGET_PORTto unused ports.
Environment overrides¶
| Variable | Default | Description |
|---|---|---|
ENGRAMIA_BRT_PGIMAGE |
pgvector/pgvector:0.7.4-pg16 |
Docker image |
ENGRAMIA_BRT_SOURCE_PORT |
15441 |
Host port for source container |
ENGRAMIA_BRT_TARGET_PORT |
15442 |
Host port for target container |
Off-Site Storage Configuration¶
Hetzner Object Storage (FSN1 region, S3-compatible):
# Configure aws CLI for Hetzner
aws configure set aws_access_key_id YOUR_ACCESS_KEY
aws configure set aws_secret_access_key YOUR_SECRET_KEY
aws configure set default.region eu-central-1
# Test access
aws --endpoint-url https://fsn1.your-objectstorage.com s3 ls s3://engramia-backups/
Create the bucket in the Hetzner Cloud Console → Object Storage → Create Bucket. Set lifecycle rule: delete objects older than 30 days.