patroni failover
Summary
Patroni performed a failover that was not 100% clean:
- The old node remained in the load balancer
- The old node is failing to start postgres, so it is unable to become master in the future
Timeline of events
- 2018-12-30 13:07 UTC: patroni initiated a failover, node-01 not healthy, node-04 was elected as new leader
- 2018-12-30 13:08 UTC: alert that postgresql is down
- 2018-12-30 13:13 UTC: alert that patroni-01 is down
- 2018-12-30 13:14 UTC: on-call engineer is investigating
- 2018-12-30 13:28 UTC: on-call engineer removes node-01 from the patroni google load balancer
- 2018-12-30 14:04 UTC: node-01 was declared corrupted. Process to rebuild this node was begun
- 2018-12-30 14:49 UTC: we hup'd all unicorns while node-01 is down
- 2018-12-30 15:44 UTC: patroni-01 is back-online and replica data is caught up
Monitoring
Logs
2018-12-30_13:07:08 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:08,631 ERROR: touch_member
2018-12-30_13:07:08 patroni-01-db-gprd patroni[43008]: Traceback (most recent call last):
2018-12-30_13:07:08 patroni-01-db-gprd patroni[43008]: File "/opt/patroni/lib/python3.5/site-packages/patroni/dcs/consul.py", line 358, in touch_member
2018-12-30_13:07:08 patroni-01-db-gprd patroni[43008]: self._client.kv.put(self.member_path, json.dumps(data, separators=(',', ':')), **args)
2018-12-30_13:07:08 patroni-01-db-gprd patroni[43008]: File "/opt/patroni/lib/python3.5/site-packages/consul/base.py", line 621, in put
2018-12-30_13:07:08 patroni-01-db-gprd patroni[43008]: CB.json(), '/v1/kv/%s' % key, params=params, data=value)
2018-12-30_13:07:08 patroni-01-db-gprd patroni[43008]: File "/opt/patroni/lib/python3.5/site-packages/patroni/dcs/consul.py", line 105, in wrapper
2018-12-30_13:07:08 patroni-01-db-gprd patroni[43008]: return callback(self.response(self.http.request(method.upper(), self.uri(path, params), **kwargs)))
2018-12-30_13:07:08 patroni-01-db-gprd patroni[43008]: File "/opt/patroni/lib/python3.5/site-packages/patroni/dcs/consul.py", line 76, in response
2018-12-30_13:07:08 patroni-01-db-gprd patroni[43008]: raise ConsulInternalError(msg)
2018-12-30_13:07:08 patroni-01-db-gprd patroni[43008]: patroni.dcs.consul.ConsulInternalError: 500 rpc error making call: rpc error making call: invalid session "106a64cb-4e26-f849-7d09-32c3f4b95f85"
2018-12-30_13:07:17 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:17,562 INFO: Could not take out TTL lock
2018-12-30_13:07:34 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:34,475 INFO: demoted self after trying and failing to obtain lock
2018-12-30_13:07:34 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:34,477 WARNING: Loop time exceeded, rescheduling immediately.
2018-12-30_13:07:34 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:34,480 INFO: Lock owner: None; I am patroni-01-db-gprd.c.gitlab-production.internal
2018-12-30_13:07:34 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:34,480 INFO: not healthy enough for leader race
2018-12-30_13:07:34 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:34,480 INFO: starting after demotion in progress
2018-12-30_13:07:34 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:34,578 INFO: closed patroni connection to the postgresql cluster
2018-12-30_13:07:34 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:34,657 INFO: postmaster pid=17607
2018-12-30_13:07:34 patroni-01-db-gprd patroni[43008]: localhost:5432 - no response
2018-12-30_13:07:36 patroni-01-db-gprd patroni[43008]: localhost:5432 - rejecting connections
2018-12-30_13:07:36 patroni-01-db-gprd patroni[43008]: localhost:5432 - rejecting connections
2018-12-30_13:07:37 patroni-01-db-gprd patroni[43008]: localhost:5432 - rejecting connections
2018-12-30_13:07:38 patroni-01-db-gprd patroni[43008]: localhost:5432 - rejecting connections
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]: localhost:5432 - accepting connections
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:39,681 INFO: Lock owner: patroni-04-db-gprd.c.gitlab-production.internal; I am patroni-01-db-gprd.c.gitlab-production.internal
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:39,682 INFO: does not have lock
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:39,682 INFO: establishing a new patroni connection to the postgres cluster
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:39,698 INFO: Local timeline=7 lsn=A666/C7EB1180
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:39,729 INFO: master_timeline=8
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:39,729 INFO: master: history=1#011D2/AE3F3440#011no recovery target specified
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]: 2#0112024/9000098#011no recovery target specified
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]: 3#0113A75/90000098#011no recovery target specified
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]: 4#0115C10/3000098#011no recovery target specified
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]: 5#0117C6A/7000000#011no recovery target specified
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]: 6#0119EDE/64000100#011no recovery target specified
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]: 7#011A666/C775F138#011no recovery target specified
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:39,740 INFO: running pg_rewind from patroni-04-db-gprd.c.gitlab-production.internal
2018-12-30_13:07:43 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:43,038 INFO: running pg_rewind from user=gitlab-superuser host=10.220.16.104 port=5432 dbname=postgres sslmode=prefer sslcompression=1
2018-12-30_13:07:43 patroni-01-db-gprd patroni[43008]: servers diverged at WAL position A666/C775F138 on timeline 7
2018-12-30_13:07:43 patroni-01-db-gprd patroni[43008]: rewinding from last common checkpoint at A666/BAE3F0C8 on timeline 7
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: could not open target file "/var/opt/gitlab/postgresql/data/server.crt": Permission denied
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: Failure, exiting
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:44,049 ERROR: Failed to rewind from healty master: patroni-04-db-gprd.c.gitlab-production.internal
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:44,054 WARNING: Postgresql is not running.
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:44,054 INFO: Lock owner: patroni-04-db-gprd.c.gitlab-production.internal; I am patroni-01-db-gprd.c.gitlab-production.internal
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: pg_controldata: could not read file "/var/opt/gitlab/postgresql/data/global/pg_control": read 0 of 264
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:44,056 ERROR: Error when calling pg_controldata
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: Traceback (most recent call last):
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: File "/opt/patroni/lib/python3.5/site-packages/patroni/postgresql.py", line 1229, in controldata
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: data = subprocess.check_output([self._pgcommand('pg_controldata'), self._data_dir], env=env)
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: File "/usr/lib/python3.5/subprocess.py", line 626, in check_output
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: **kwargs).stdout
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: File "/usr/lib/python3.5/subprocess.py", line 708, in run
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: output=stdout, stderr=stderr)
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: subprocess.CalledProcessError: Command '['/usr/lib/postgresql/9.6/bin/pg_controldata', '/var/opt/gitlab/postgresql/data']' returned non-zero exit status 1
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:44,150 INFO: Lock owner: patroni-04-db-gprd.c.gitlab-production.internal; I am patroni-01-db-gprd.c.gitlab-production.internal
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:44,161 INFO: starting as a secondary
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: pg_controldata: could not read file "/var/opt/gitlab/postgresql/data/global/pg_control": read 0 of 264
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:44,164 ERROR: Error when calling pg_controldata
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: Traceback (most recent call last):
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: File "/opt/patroni/lib/python3.5/site-packages/patroni/postgresql.py", line 1229, in controldata
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: data = subprocess.check_output([self._pgcommand('pg_controldata'), self._data_dir], env=env)
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: File "/usr/lib/python3.5/subprocess.py", line 626, in check_output
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: **kwargs).stdout
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: File "/usr/lib/python3.5/subprocess.py", line 708, in run
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: output=stdout, stderr=stderr)
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: subprocess.CalledProcessError: Command '['/usr/lib/postgresql/9.6/bin/pg_controldata', '/var/opt/gitlab/postgresql/data']' returned non-zero exit status 1
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:44,165 ERROR: Exception during execution of long running task restarting after failure
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: Traceback (most recent call last):
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: File "/opt/patroni/lib/python3.5/site-packages/patroni/async_executor.py", line 97, in run
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: wakeup = func(*args) if args else func()
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: File "/opt/patroni/lib/python3.5/site-packages/patroni/postgresql.py", line 1459, in follow
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: self.start(timeout=timeout)
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: File "/opt/patroni/lib/python3.5/site-packages/patroni/postgresql.py", line 901, in start
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: configuration = self._server_parameters if self.role == 'master' else self._build_effective_configuration()
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: File "/opt/patroni/lib/python3.5/site-packages/patroni/postgresql.py", line 870, in _build_effective_configuration
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: cvalue = parse_int(data[cname])
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: KeyError: 'max_connections setting'
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: localhost:5432 - no response
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:44,194 INFO: Lock owner: patroni-04-db-gprd.c.gitlab-production.internal; I am patroni-01-db-gprd.c.gitlab-production.internal
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:44,204 INFO: failed to start postgres
2018-12-30_13:07:54 patroni-01-db-gprd patroni[43008]: 2018-12-30 13:07:54,462 WARNING: Postgresql is not running.
What went well
- Patroni performed a failover which allowed out app to remain up
What did not go well
- When leader is rotated, we have a few roles where we hard code who that master is, so we need to run chef on many many nodes to force some updates... https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5675
node-01 remained healthy in the load balancer causing many writes to be rejected- All secondary nodes were reporting 'ERROR: cannot execute INSERT in a read-only transaction' https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5843
- the wal-e backups alerted
became hung and are delayedhttps://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5839 - the on-call engineer became quickly over whelmed due to many other alerts firing while still investigating the root cause of all the alerts https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5839 https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5841
- node-01 is unable to be restarted - the postgres service will not stay up - patroni control data returns no data https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5842
- missing some runbook data on discovering the above information and performing a node rebuild https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5840
- we need make easier, the process of recovering a failed node - there's a lot of handcrafted steps that can introduce error, and a lengthy amount of time to getting a node back to a usable state. https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5840
- During repointing nodes to talk to the new leader, pgbouncer on patroni-04 became overloaded and many connections started to become blocked https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5838
- node-04 was above our alert thresholds for table vacuums immediately when he was elected leader https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5841
- consul logs on patroni-01 have been lost https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5844
- SSL Certificates in the data directory may have prevented patroni-01 from being able to participate in the cluster immediately after failover https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5849
- upgrade patroni https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5856
- utilize a different consul configuration for patroni https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5857
- determine if our current count of postgres replicas is kosher during cluster incidents https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5893
Corrective Actions
Edited by John Skarbek