Add recovery.yaml and recovery info
This is a post-mortem to this:
In response to a bug report, I attempted to push an update to Tribes, which resulted in the Postgres database completely failing to start:
$ kubectl logs -n tribeshost pod/tribeshost-deployment-579cf6bcb7-bwfcj postgres
PostgreSQL Database directory appears to contain a database; Skipping initialization
2021-03-27 21:59:07.544 UTC [1] LOG: starting PostgreSQL 13.1 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.2.1_pre1) 10.2.1 20201203, 64-bit
2021-03-27 21:59:07.545 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
2021-03-27 21:59:07.545 UTC [1] LOG: listening on IPv6 address "::", port 5432
2021-03-27 21:59:07.570 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2021-03-27 21:59:07.604 UTC [21] LOG: database system was shut down at 2021-03-27 19:39:32 UTC
2021-03-27 21:59:07.605 UTC [21] LOG: invalid resource manager ID in primary checkpoint record
2021-03-27 21:59:07.605 UTC [21] PANIC: could not locate a valid checkpoint record
2021-03-27 21:59:08.101 UTC [22] FATAL: the database system is starting up
2021-03-27 21:59:08.102 UTC [23] FATAL: the database system is starting up
2021-03-27 21:59:08.569 UTC [1] LOG: startup process (PID 21) was terminated by signal 6: Aborted
2021-03-27 21:59:08.569 UTC [1] LOG: aborting startup due to startup process failure
2021-03-27 21:59:08.574 UTC [1] LOG: database system is shut down
The error, PANIC: could not locate a valid checkpoint record
, means a corrupted database.
To make matters worse, I couldn't even run commands in the container, because the container itself would die right after exiting:
$ kubectl get pod -n tribeshost
NAME READY STATUS RESTARTS AGE
tribeshost-deployment-579cf6bcb7-bwfcj 1/2 CrashLoopBackOff 26 113m
$ kubectl exec -it -n tribeshost pod/tribeshost-deployment-579cf6bcb7-bwfcj -c postgres -- /bin/sh
error: unable to upgrade connection: container not found ("postgres")
I was very exhausted from other things by this point, and had IRL things I had to do, so I had to leave this for 2 days while being stressed about it because I had no idea what to do.
Fortunately I spent that time formulating a plan in my head, and it ended up working quickly once I executed on it.
I created a recovery.yaml
deployment manifest and copied the tribeshost-deployment into it. I then modified the postgres container to sleep for 24 hours instead of running the usual command. This prevented it from exiting immediately.
I deleted the tribeshost-deployment and deployed this recovery deployment instead:
$ kubectl delete deployment -n tribeshost tribeshost-deployment
deployment.apps "tribeshost-deployment" deleted
$ kubectl apply -f installation/k8s/recovery.yaml
deployment.apps/tribeshost-recovery created
(See recovery.yaml
in this MR for that deployment)
I was then able to shell into that container and run the command I needed to fix the database corruption:
$ kubectl exec -it -n tribeshost tribeshost-recovery-c6489487d-zv9fl -c postgres -- /bin/sh
/ # su postgres
/ $ pg_resetwal $DATADIR
Afterwards, I deleted the recovery deployment and deployed tribeshost.yaml:
$ kubectl delete deployment -n tribeshost tribeshost-recovery
deployment.apps "tribeshost-recovery" deleted
$ kubectl apply -f installation/k8s/tribeshost.yaml
namespace/tribeshost unchanged
storageclass.storage.k8s.io/microk8s-hostpath unchanged
persistentvolumeclaim/tribeshost-db unchanged
deployment.apps/tribeshost-deployment created
service/tribeshost-service unchanged
httpproxy.projectcontour.io/tribeshost-httpproxy unchanged
certificate.cert-manager.io/tribeshost-certificate unchanged
Fortunately, it deployed fine, and everything started working again after that.
But how can we prevent this in the future?
I have a hunch there is some weirdness going on with our storage class, Longhorn. I'm thinking of maybe switching over to Crunchy PostgreSQL Operator (I've been saying this for a while, but it intimidates me).