Investigate what happened to patroni-01 after demotion
During a recent failover production#637 (closed), patroni-01 was unable to talk to the cluster after it had demoted itself.
As noted in the logs:
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:44,054 WARNING: Postgresql is not running.
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:44,054 INFO: Lock owner: patroni-04-db-gprd.c.gitlab-production.internal; I am patroni-01-db-gprd.c.gitlab-production.internal
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: pg_controldata: could not read file "/var/opt/gitlab/postgresql/data/global/pg_control": read 0 of 264
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:44,056 ERROR: Error when calling pg_controldata
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: Traceback (most recent call last):
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: File "/opt/patroni/lib/python3.5/site-packages/patroni/postgresql.py", line 1229, in controldata
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: data = subprocess.check_output([self._pgcommand('pg_controldata'), self._data_dir], env=env)
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: File "/usr/lib/python3.5/subprocess.py", line 626, in check_output
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: **kwargs).stdout
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: File "/usr/lib/python3.5/subprocess.py", line 708, in run
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: output=stdout, stderr=stderr)
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: subprocess.CalledProcessError: Command '['/usr/lib/postgresql/9.6/bin/pg_controldata', '/var/opt/gitlab/postgresql/data']' returned non-zero exit status 1
The pg_controldata was not returning any information about the cluster. Answer the following questions:
- Why did this happen?
- How can we prevent this in the future?
- Is there a better way to recover than completely rebuilding this node?