Patroni failovers can still fail with error 'could not locate a valid checkpoint record'

Summary

After switching our performance test environments to use Patroni and PG12 we've been seeing a sporadic issue with them with Patroni.

Not unlike a previous reported issue, #5746 (closed), the issue is around failover. Specifically a node that was previously a primary is unable to rejoin the cluster as a secondary due to what appears to be a corrupted transaction log with the error PANIC: could not locate a valid checkpoint record.

In the previous issue, #5746 (closed), we also saw similar issues on our environments. However, when those issues were happening it was discovered that was pg_rewind wasn't enabled by default and that was the fix at the time (along with additional optional flags to enable Patroni to performance corrective actions on certain failures). This issue is occurring still with all of the options described above as enabled.

One potential reason as why this is happening is that on our environments we're switching them off daily after running our tests. We do this with a gcloud command - gcloud compute instances stop - which, according to the docs, performs a clean shutdown, much like invoking the shutdown functionality of a workstation or laptop. My expectation was this should be fine (and it has been for every other component in GitLab so far) as a clean shutdown should in turn invoke the GitLab services to stop but this is only an assumption on my part and if it's not working like this then the machine shutdowns could be causing the transaction log to corrupt?

Another potential reason is we're upgrading the boxes each day with the latest omnibus nightly, it's possible that the reconfigure is also corrupting the log.

Steps to reproduce

Currently we're seeing the issue happen on at least one of our 5 applicable test environments weekly. To reproduce a HA Postgres cluster would be needed and potentially attempting multiple shutdowns and upgrades to the latest nightly package.

What is the current bug behavior?

Former primary Postgres nodes can fail after switching to a secondary with the error PANIC: could not locate a valid checkpoint record.

What is the expected correct behavior?

Postgres can failover to a new primary without issue or at least work with the optional corrective actions.

Relevant logs

When the node fails it will enter a reboot loop with the following error being posted constantly:

2021-02-01_11:26:23.41899 2021-02-01 11:26:23,417 INFO: Lock owner: gitlab-qa-50k-postgres-3.c.gitlab-qa-50k-193234.internal; I am gitlab-qa-50k-postgres-2.c.gitlab-qa-50k-193234.internal
2021-02-01_11:26:23.42095 2021-02-01 11:26:23,420 INFO: starting as a secondary
2021-02-01_11:26:23.61027 2021-02-01 11:26:23,609 INFO: postmaster pid=3496
2021-02-01_11:26:23.61148 LOG:  starting PostgreSQL 12.4 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 7.5.0-3ubuntu1~18.04) 7.5.0, 64-bit
2021-02-01_11:26:23.61151 LOG:  listening on IPv4 address "0.0.0.0", port 5432
2021-02-01_11:26:23.61369 /var/opt/gitlab/postgresql:5432 - no response
2021-02-01_11:26:23.61617 LOG:  listening on Unix socket "/var/opt/gitlab/postgresql/.s.PGSQL.5432"
2021-02-01_11:26:24.13034 LOG:  database system was shut down in recovery at 2021-02-01 01:36:23 GMT
2021-02-01_11:26:24.13050 LOG:  entering standby mode
2021-02-01_11:26:24.13062 LOG:  invalid resource manager ID in primary checkpoint record
2021-02-01_11:26:24.13063 PANIC:  could not locate a valid checkpoint record
2021-02-01_11:26:24.26456 LOG:  startup process (PID 3498) was terminated by signal 6: Aborted
2021-02-01_11:26:24.26459 LOG:  aborting startup due to startup process failure
2021-02-01_11:26:24.33607 LOG:  database system is shut down

Configuration details

`patroni.yaml` file of node that failed


name: gitlab-qa-50k-postgres-2.c.gitlab-qa-50k-193234.internal
scope: postgresql-ha
log:
  level: INFO
consul:
  url: http://127.0.0.1:8500
  service_check_interval: 10s
  register_service: true
  checks: []
postgresql:
  bin_dir: /opt/gitlab/embedded/bin
  data_dir: /var/opt/gitlab/postgresql/data
  config_dir: /var/opt/gitlab/postgresql/data
  listen: 0.0.0.0:5432
  connect_address: 10.142.0.43:5432
  use_unix_socket: true
  parameters:
    unix_socket_directories: /var/opt/gitlab/postgresql
  authentication:
    superuser:
      username: gitlab-psql
    replication:
      username: gitlab_replicator
  remove_data_directory_on_rewind_failure: true
  remove_data_directory_on_diverged_timelines: true
bootstrap:
  dcs: {"postgresql":{"parameters":{"wal_level":"replica","hot_standby":"on","wal_keep_segments":10,"max_wal_senders":4,"max_replication_slots":4,"checkpoint_timeout":30,"max_prepared_transactions":0,"track_commit_timestamp":"off","max_connections":500,"max_locks_per_transaction":128,"max_worker_processes":8,"wal_log_hints":"off"},"use_pg_rewind":true,"use_slots":true},"slots":{},"loop_wait":10,"ttl":30,"retry_timeout":10,"maximum_lag_on_failover":1048576,"max_timelines_history":0,"master_start_timeout":300}
  method: gitlab_ctl
  gitlab_ctl:
    command: /opt/gitlab/bin/gitlab-ctl patroni bootstrap --srcdir=/var/opt/gitlab/patroni/data
restapi:
  listen: :8008
  connect_address: 10.142.0.43:8008

Edited Feb 01, 2021 by Grant Young