diff --git a/backend/postgres/guides/WAL_RECOVERY.md b/backend/postgres/guides/WAL_RECOVERY.md new file mode 100644 index 0000000..ee5cf8d --- /dev/null +++ b/backend/postgres/guides/WAL_RECOVERY.md @@ -0,0 +1,107 @@ +# PostgreSQL WAL Corruption Recovery Guide + +## Symptoms + +PostgreSQL container crashes on startup with logs showing: + +``` +LOG: unexpected pageaddr X/Y in WAL segment ... +LOG: invalid checkpoint record +PANIC: could not locate a valid checkpoint record at ... +LOG: startup process (PID N) was terminated by signal 6: Aborted +LOG: aborting startup due to startup process failure +``` + +The container restarts repeatedly, each time hitting the same error. + +## Cause + +The Write-Ahead Log (WAL) was corrupted by an unclean shutdown (power loss, host crash, force kill, etc.). PostgreSQL cannot find a valid checkpoint to resume from. + +## What You Risk Losing + +- **Committed data**: Safe. It is already written to the data files. +- **Uncommitted transactions** from the moment of the crash: Lost. These were only in WAL. +- **Recent changes** that were committed but not yet checkpointed: Usually safe, but there is a small risk of inconsistency. + +## Recovery Procedure + +### 1. Stop the Crashing Container + +```bash +cd /path/to/postgres/service +docker compose down +``` + +### 2. Run `pg_resetwal` + +This resets the WAL and forces a clean start. + +**If your data is in a named Docker volume (e.g., `pgdata`):** + +```bash +docker run --rm \ + -v pgdata:/var/lib/postgresql \ + --user postgres \ + postgres:18 \ + pg_resetwal -f /var/lib/postgresql/18/docker +``` + +> Adjust the path `/var/lib/postgresql/18/docker` to match your `PGDATA` setting. + +**If your data is in a bind mount (e.g., `./data`):** + +```bash +docker run --rm \ + -v $(pwd)/data:/var/lib/postgresql/data \ + --user postgres \ + postgres:18 \ + pg_resetwal -f /var/lib/postgresql/data +``` + +### 3. Start the Database + +```bash +docker compose up -d +``` + +### 4. Verify + +```bash +docker compose logs --tail=20 +docker inspect --format='{{.State.Health.Status}}' postgres +``` + +You should see: + +``` +LOG: database system is ready to accept connections +``` + +And the health status should be `healthy`. + +## Prevention + +- Ensure graceful shutdowns: `docker compose down` instead of `docker kill` +- Use a UPS if running on bare metal to avoid power-loss crashes +- Keep backups of important data volumes +- Consider setting `restart: unless-stopped` instead of `always` to prevent rapid crash loops + +## When NOT to Use This Fix + +Do **not** use `pg_resetwal` if: +- You have a recent base backup and WAL archive — restore from backup instead +- You suspect data file corruption (not just WAL corruption) +- You can recover by other means (e.g., starting from a replication standby) + +If unsure, copy the data directory somewhere safe before running `pg_resetwal`. + +## One-Liner for Future Emergencies + +If you're sure it's WAL corruption and you know your setup: + +```bash +docker compose down && \ +docker run --rm -v pgdata:/var/lib/postgresql --user postgres postgres:18 pg_resetwal -f /var/lib/postgresql/18/docker && \ +docker compose up -d +```