Patroni failovers can still fail with error 'could not locate a valid checkpoint record'
Summary
After switching our performance test environments to use Patroni and PG12 we've been seeing a sporadic issue with them with Patroni.
Not unlike a previous reported issue, #5746 (closed), the issue is around failover. Specifically a node that was previously a primary is unable to rejoin the cluster as a secondary due to what appears to be a corrupted transaction log with the error PANIC: could not locate a valid checkpoint record
.
In the previous issue, #5746 (closed), we also saw similar issues on our environments. However, when those issues were happening it was discovered that was pg_rewind
wasn't enabled by default and that was the fix at the time (along with additional optional flags to enable Patroni to performance corrective actions on certain failures). This issue is occurring still with all of the options described above as enabled.
One potential reason as why this is happening is that on our environments we're switching them off daily after running our tests. We do this with a gcloud
command - gcloud compute instances stop
- which, according to the docs, performs a clean shutdown, much like invoking the shutdown functionality of a workstation or laptop
. My expectation was this should be fine (and it has been for every other component in GitLab so far) as a clean shutdown should in turn invoke the GitLab services to stop but this is only an assumption on my part and if it's not working like this then the machine shutdowns could be causing the transaction log to corrupt?
Another potential reason is we're upgrading the boxes each day with the latest omnibus nightly, it's possible that the reconfigure is also corrupting the log.
Steps to reproduce
Currently we're seeing the issue happen on at least one of our 5 applicable test environments weekly. To reproduce a HA Postgres cluster would be needed and potentially attempting multiple shutdowns and upgrades to the latest nightly package.
What is the current bug behavior?
Former primary Postgres nodes can fail after switching to a secondary with the error PANIC: could not locate a valid checkpoint record
.
What is the expected correct behavior?
Postgres can failover to a new primary without issue or at least work with the optional corrective actions.
Relevant logs
Relevant logs
When the node fails it will enter a reboot loop with the following error being posted constantly:
2021-02-01_11:26:23.41899 2021-02-01 11:26:23,417 INFO: Lock owner: gitlab-qa-50k-postgres-3.c.gitlab-qa-50k-193234.internal; I am gitlab-qa-50k-postgres-2.c.gitlab-qa-50k-193234.internal 2021-02-01_11:26:23.42095 2021-02-01 11:26:23,420 INFO: starting as a secondary 2021-02-01_11:26:23.61027 2021-02-01 11:26:23,609 INFO: postmaster pid=3496 2021-02-01_11:26:23.61148 LOG: starting PostgreSQL 12.4 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit 2021-02-01_11:26:23.61151 LOG: listening on IPv4 address "0.0.0.0", port 5432 2021-02-01_11:26:23.61369 /var/opt/gitlab/postgresql:5432 - no response 2021-02-01_11:26:23.61617 LOG: listening on Unix socket "/var/opt/gitlab/postgresql/.s.PGSQL.5432" 2021-02-01_11:26:24.13034 LOG: database system was shut down in recovery at 2021-02-01 01:36:23 GMT 2021-02-01_11:26:24.13050 LOG: entering standby mode 2021-02-01_11:26:24.13062 LOG: invalid resource manager ID in primary checkpoint record 2021-02-01_11:26:24.13063 PANIC: could not locate a valid checkpoint record 2021-02-01_11:26:24.26456 LOG: startup process (PID 3498) was terminated by signal 6: Aborted 2021-02-01_11:26:24.26459 LOG: aborting startup due to startup process failure 2021-02-01_11:26:24.33607 LOG: database system is shut down
Configuration details
`patroni.yaml` file of node that failed
name: gitlab-qa-50k-postgres-2.c.gitlab-qa-50k-193234.internal scope: postgresql-ha log: level: INFO consul: url: http://127.0.0.1:8500 service_check_interval: 10s register_service: true checks: [] postgresql: bin_dir: /opt/gitlab/embedded/bin data_dir: /var/opt/gitlab/postgresql/data config_dir: /var/opt/gitlab/postgresql/data listen: 0.0.0.0:5432 connect_address: 10.142.0.43:5432 use_unix_socket: true parameters: unix_socket_directories: /var/opt/gitlab/postgresql authentication: superuser: username: gitlab-psql replication: username: gitlab_replicator remove_data_directory_on_rewind_failure: true remove_data_directory_on_diverged_timelines: true bootstrap: dcs: {"postgresql":{"parameters":{"wal_level":"replica","hot_standby":"on","wal_keep_segments":10,"max_wal_senders":4,"max_replication_slots":4,"checkpoint_timeout":30,"max_prepared_transactions":0,"track_commit_timestamp":"off","max_connections":500,"max_locks_per_transaction":128,"max_worker_processes":8,"wal_log_hints":"off"},"use_pg_rewind":true,"use_slots":true},"slots":{},"loop_wait":10,"ttl":30,"retry_timeout":10,"maximum_lag_on_failover":1048576,"max_timelines_history":0,"master_start_timeout":300} method: gitlab_ctl gitlab_ctl: command: /opt/gitlab/bin/gitlab-ctl patroni bootstrap --srcdir=/var/opt/gitlab/patroni/data restapi: listen: :8008 connect_address: 10.142.0.43:8008