Add recovery.yaml and recovery info (!16) · Merge requests · Tribes / Tribes

This is a post-mortem to this:

In response to a bug report, I attempted to push an update to Tribes, which resulted in the Postgres database completely failing to start:

$ kubectl logs -n tribeshost pod/tribeshost-deployment-579cf6bcb7-bwfcj postgres

PostgreSQL Database directory appears to contain a database; Skipping initialization

2021-03-27 21:59:07.544 UTC [1] LOG:  starting PostgreSQL 13.1 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.2.1_pre1) 10.2.1 20201203, 64-bit
2021-03-27 21:59:07.545 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2021-03-27 21:59:07.545 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2021-03-27 21:59:07.570 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2021-03-27 21:59:07.604 UTC [21] LOG:  database system was shut down at 2021-03-27 19:39:32 UTC
2021-03-27 21:59:07.605 UTC [21] LOG:  invalid resource manager ID in primary checkpoint record
2021-03-27 21:59:07.605 UTC [21] PANIC:  could not locate a valid checkpoint record
2021-03-27 21:59:08.101 UTC [22] FATAL:  the database system is starting up
2021-03-27 21:59:08.102 UTC [23] FATAL:  the database system is starting up
2021-03-27 21:59:08.569 UTC [1] LOG:  startup process (PID 21) was terminated by signal 6: Aborted
2021-03-27 21:59:08.569 UTC [1] LOG:  aborting startup due to startup process failure
2021-03-27 21:59:08.574 UTC [1] LOG:  database system is shut down

The error, PANIC: could not locate a valid checkpoint record, means a corrupted database.

To make matters worse, I couldn't even run commands in the container, because the container itself would die right after exiting:

$ kubectl get pod -n tribeshost
NAME                                     READY   STATUS             RESTARTS   AGE
tribeshost-deployment-579cf6bcb7-bwfcj   1/2     CrashLoopBackOff   26         113m
$ kubectl exec -it -n tribeshost pod/tribeshost-deployment-579cf6bcb7-bwfcj -c postgres -- /bin/sh
error: unable to upgrade connection: container not found ("postgres")

I was very exhausted from other things by this point, and had IRL things I had to do, so I had to leave this for 2 days while being stressed about it because I had no idea what to do.

Fortunately I spent that time formulating a plan in my head, and it ended up working quickly once I executed on it.

I created a recovery.yaml deployment manifest and copied the tribeshost-deployment into it. I then modified the postgres container to sleep for 24 hours instead of running the usual command. This prevented it from exiting immediately.

I deleted the tribeshost-deployment and deployed this recovery deployment instead:

$ kubectl delete deployment -n tribeshost tribeshost-deployment
deployment.apps "tribeshost-deployment" deleted
$ kubectl apply -f installation/k8s/recovery.yaml  
deployment.apps/tribeshost-recovery created

(See recovery.yaml in this MR for that deployment)

I was then able to shell into that container and run the command I needed to fix the database corruption:

$ kubectl exec -it -n tribeshost tribeshost-recovery-c6489487d-zv9fl -c postgres -- /bin/sh
/ # su postgres
/ $ pg_resetwal $DATADIR

Afterwards, I deleted the recovery deployment and deployed tribeshost.yaml:

$ kubectl delete deployment -n tribeshost tribeshost-recovery  
deployment.apps "tribeshost-recovery" deleted
$ kubectl apply -f installation/k8s/tribeshost.yaml
namespace/tribeshost unchanged
storageclass.storage.k8s.io/microk8s-hostpath unchanged
persistentvolumeclaim/tribeshost-db unchanged
deployment.apps/tribeshost-deployment created
service/tribeshost-service unchanged
httpproxy.projectcontour.io/tribeshost-httpproxy unchanged
certificate.cert-manager.io/tribeshost-certificate unchanged

Fortunately, it deployed fine, and everything started working again after that.

But how can we prevent this in the future?

I have a hunch there is some weirdness going on with our storage class, Longhorn. I'm thinking of maybe switching over to Crunchy PostgreSQL Operator (I've been saying this for a while, but it intimidates me).

Add recovery.yaml and recovery info

Merge request reports