Compare commits
2 Commits
701a9b490d
...
main
| Author | SHA1 | Date | |
|---|---|---|---|
| 35c8657f13 | |||
| c6436d720e |
107
backend/postgres/guides/WAL_RECOVERY.md
Normal file
107
backend/postgres/guides/WAL_RECOVERY.md
Normal file
@@ -0,0 +1,107 @@
|
|||||||
|
# PostgreSQL WAL Corruption Recovery Guide
|
||||||
|
|
||||||
|
## Symptoms
|
||||||
|
|
||||||
|
PostgreSQL container crashes on startup with logs showing:
|
||||||
|
|
||||||
|
```
|
||||||
|
LOG: unexpected pageaddr X/Y in WAL segment ...
|
||||||
|
LOG: invalid checkpoint record
|
||||||
|
PANIC: could not locate a valid checkpoint record at ...
|
||||||
|
LOG: startup process (PID N) was terminated by signal 6: Aborted
|
||||||
|
LOG: aborting startup due to startup process failure
|
||||||
|
```
|
||||||
|
|
||||||
|
The container restarts repeatedly, each time hitting the same error.
|
||||||
|
|
||||||
|
## Cause
|
||||||
|
|
||||||
|
The Write-Ahead Log (WAL) was corrupted by an unclean shutdown (power loss, host crash, force kill, etc.). PostgreSQL cannot find a valid checkpoint to resume from.
|
||||||
|
|
||||||
|
## What You Risk Losing
|
||||||
|
|
||||||
|
- **Committed data**: Safe. It is already written to the data files.
|
||||||
|
- **Uncommitted transactions** from the moment of the crash: Lost. These were only in WAL.
|
||||||
|
- **Recent changes** that were committed but not yet checkpointed: Usually safe, but there is a small risk of inconsistency.
|
||||||
|
|
||||||
|
## Recovery Procedure
|
||||||
|
|
||||||
|
### 1. Stop the Crashing Container
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cd /path/to/postgres/service
|
||||||
|
docker compose down
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2. Run `pg_resetwal`
|
||||||
|
|
||||||
|
This resets the WAL and forces a clean start.
|
||||||
|
|
||||||
|
**If your data is in a named Docker volume (e.g., `pgdata`):**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker run --rm \
|
||||||
|
-v pgdata:/var/lib/postgresql \
|
||||||
|
--user postgres \
|
||||||
|
postgres:18 \
|
||||||
|
pg_resetwal -f /var/lib/postgresql/18/docker
|
||||||
|
```
|
||||||
|
|
||||||
|
> Adjust the path `/var/lib/postgresql/18/docker` to match your `PGDATA` setting.
|
||||||
|
|
||||||
|
**If your data is in a bind mount (e.g., `./data`):**
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker run --rm \
|
||||||
|
-v $(pwd)/data:/var/lib/postgresql/data \
|
||||||
|
--user postgres \
|
||||||
|
postgres:18 \
|
||||||
|
pg_resetwal -f /var/lib/postgresql/data
|
||||||
|
```
|
||||||
|
|
||||||
|
### 3. Start the Database
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker compose up -d
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Verify
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker compose logs --tail=20
|
||||||
|
docker inspect --format='{{.State.Health.Status}}' postgres
|
||||||
|
```
|
||||||
|
|
||||||
|
You should see:
|
||||||
|
|
||||||
|
```
|
||||||
|
LOG: database system is ready to accept connections
|
||||||
|
```
|
||||||
|
|
||||||
|
And the health status should be `healthy`.
|
||||||
|
|
||||||
|
## Prevention
|
||||||
|
|
||||||
|
- Ensure graceful shutdowns: `docker compose down` instead of `docker kill`
|
||||||
|
- Use a UPS if running on bare metal to avoid power-loss crashes
|
||||||
|
- Keep backups of important data volumes
|
||||||
|
- Consider setting `restart: unless-stopped` instead of `always` to prevent rapid crash loops
|
||||||
|
|
||||||
|
## When NOT to Use This Fix
|
||||||
|
|
||||||
|
Do **not** use `pg_resetwal` if:
|
||||||
|
- You have a recent base backup and WAL archive — restore from backup instead
|
||||||
|
- You suspect data file corruption (not just WAL corruption)
|
||||||
|
- You can recover by other means (e.g., starting from a replication standby)
|
||||||
|
|
||||||
|
If unsure, copy the data directory somewhere safe before running `pg_resetwal`.
|
||||||
|
|
||||||
|
## One-Liner for Future Emergencies
|
||||||
|
|
||||||
|
If you're sure it's WAL corruption and you know your setup:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker compose down && \
|
||||||
|
docker run --rm -v pgdata:/var/lib/postgresql --user postgres postgres:18 pg_resetwal -f /var/lib/postgresql/18/docker && \
|
||||||
|
docker compose up -d
|
||||||
|
```
|
||||||
@@ -3,6 +3,8 @@ SUBDOMAIN=
|
|||||||
# TRAEFIK_USER=
|
# TRAEFIK_USER=
|
||||||
SSL_EMAIL_FILE=/run/secrets/CF_API_EMAIL
|
SSL_EMAIL_FILE=/run/secrets/CF_API_EMAIL
|
||||||
CF_API_EMAIL_FILE=/run/secrets/CF_API_EMAIL
|
CF_API_EMAIL_FILE=/run/secrets/CF_API_EMAIL
|
||||||
CF_API_KEY_FILE=/run/secrets/CF_API_KEY
|
ACME_EMAIL=
|
||||||
|
CF_DNS_API_TOKEN_FILE=/run/secrets/CF_DNS_API_TOKEN
|
||||||
SSH_PORT=
|
SSH_PORT=
|
||||||
TZ=
|
TZ=
|
||||||
|
TAG=
|
||||||
@@ -52,8 +52,8 @@ services:
|
|||||||
env_file:
|
env_file:
|
||||||
- .env
|
- .env
|
||||||
secrets:
|
secrets:
|
||||||
- CF_API_KEY
|
|
||||||
- CF_API_EMAIL
|
- CF_API_EMAIL
|
||||||
|
- CF_DNS_API_TOKEN
|
||||||
volumes:
|
volumes:
|
||||||
- ./traefik_data:/letsencrypt
|
- ./traefik_data:/letsencrypt
|
||||||
- /var/run/docker.sock:/var/run/docker.sock:ro
|
- /var/run/docker.sock:/var/run/docker.sock:ro
|
||||||
@@ -67,10 +67,10 @@ services:
|
|||||||
- jump
|
- jump
|
||||||
- mcp
|
- mcp
|
||||||
secrets:
|
secrets:
|
||||||
CF_API_KEY:
|
|
||||||
file: .secrets/CF_API_KEY
|
|
||||||
CF_API_EMAIL:
|
CF_API_EMAIL:
|
||||||
file: .secrets/CF_API_EMAIL
|
file: .secrets/CF_API_EMAIL
|
||||||
|
CF_DNS_API_TOKEN:
|
||||||
|
file: .secrets/CF_DNS_API_TOKEN
|
||||||
networks:
|
networks:
|
||||||
frontend:
|
frontend:
|
||||||
external:
|
external:
|
||||||
|
|||||||
Reference in New Issue
Block a user