patroni failover

Summary

Patroni performed a failover that was not 100% clean:

  • The old node remained in the load balancer
  • The old node is failing to start postgres, so it is unable to become master in the future

Timeline of events

  • 2018-12-30 13:07 UTC: patroni initiated a failover, node-01 not healthy, node-04 was elected as new leader
  • 2018-12-30 13:08 UTC: alert that postgresql is down
  • 2018-12-30 13:13 UTC: alert that patroni-01 is down
  • 2018-12-30 13:14 UTC: on-call engineer is investigating
  • 2018-12-30 13:28 UTC: on-call engineer removes node-01 from the patroni google load balancer
  • 2018-12-30 14:04 UTC: node-01 was declared corrupted. Process to rebuild this node was begun
  • 2018-12-30 14:49 UTC: we hup'd all unicorns while node-01 is down
  • 2018-12-30 15:44 UTC: patroni-01 is back-online and replica data is caught up

Monitoring

https://dashboards.gitlab.net/d/000000144/postgresql-overview?orgId=1&var-environment=gprd&var-prometheus=prometheus-01-inf-gprd&var-type=patroni&from=1546166517089&to=1546177317089

Logs

2018-12-30_13:07:08 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:08,631 ERROR: touch_member
2018-12-30_13:07:08 patroni-01-db-gprd patroni[43008]:  Traceback (most recent call last):
2018-12-30_13:07:08 patroni-01-db-gprd patroni[43008]:    File "/opt/patroni/lib/python3.5/site-packages/patroni/dcs/consul.py", line 358, in touch_member
2018-12-30_13:07:08 patroni-01-db-gprd patroni[43008]:      self._client.kv.put(self.member_path, json.dumps(data, separators=(',', ':')), **args)
2018-12-30_13:07:08 patroni-01-db-gprd patroni[43008]:    File "/opt/patroni/lib/python3.5/site-packages/consul/base.py", line 621, in put
2018-12-30_13:07:08 patroni-01-db-gprd patroni[43008]:      CB.json(), '/v1/kv/%s' % key, params=params, data=value)
2018-12-30_13:07:08 patroni-01-db-gprd patroni[43008]:    File "/opt/patroni/lib/python3.5/site-packages/patroni/dcs/consul.py", line 105, in wrapper
2018-12-30_13:07:08 patroni-01-db-gprd patroni[43008]:      return callback(self.response(self.http.request(method.upper(), self.uri(path, params), **kwargs)))
2018-12-30_13:07:08 patroni-01-db-gprd patroni[43008]:    File "/opt/patroni/lib/python3.5/site-packages/patroni/dcs/consul.py", line 76, in response
2018-12-30_13:07:08 patroni-01-db-gprd patroni[43008]:      raise ConsulInternalError(msg)
2018-12-30_13:07:08 patroni-01-db-gprd patroni[43008]:  patroni.dcs.consul.ConsulInternalError: 500 rpc error making call: rpc error making call: invalid session "106a64cb-4e26-f849-7d09-32c3f4b95f85"
2018-12-30_13:07:17 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:17,562 INFO: Could not take out TTL lock
2018-12-30_13:07:34 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:34,475 INFO: demoted self after trying and failing to obtain lock
2018-12-30_13:07:34 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:34,477 WARNING: Loop time exceeded, rescheduling immediately.
2018-12-30_13:07:34 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:34,480 INFO: Lock owner: None; I am patroni-01-db-gprd.c.gitlab-production.internal
2018-12-30_13:07:34 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:34,480 INFO: not healthy enough for leader race
2018-12-30_13:07:34 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:34,480 INFO: starting after demotion in progress
2018-12-30_13:07:34 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:34,578 INFO: closed patroni connection to the postgresql cluster
2018-12-30_13:07:34 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:34,657 INFO: postmaster pid=17607
2018-12-30_13:07:34 patroni-01-db-gprd patroni[43008]:  localhost:5432 - no response
2018-12-30_13:07:36 patroni-01-db-gprd patroni[43008]:  localhost:5432 - rejecting connections
2018-12-30_13:07:36 patroni-01-db-gprd patroni[43008]:  localhost:5432 - rejecting connections
2018-12-30_13:07:37 patroni-01-db-gprd patroni[43008]:  localhost:5432 - rejecting connections
2018-12-30_13:07:38 patroni-01-db-gprd patroni[43008]:  localhost:5432 - rejecting connections
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]:  localhost:5432 - accepting connections
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:39,681 INFO: Lock owner: patroni-04-db-gprd.c.gitlab-production.internal; I am patroni-01-db-gprd.c.gitlab-production.internal
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:39,682 INFO: does not have lock
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:39,682 INFO: establishing a new patroni connection to the postgres cluster
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:39,698 INFO: Local timeline=7 lsn=A666/C7EB1180
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:39,729 INFO: master_timeline=8
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:39,729 INFO: master: history=1#011D2/AE3F3440#011no recovery target specified
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]:  2#0112024/9000098#011no recovery target specified
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]:  3#0113A75/90000098#011no recovery target specified
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]:  4#0115C10/3000098#011no recovery target specified
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]:  5#0117C6A/7000000#011no recovery target specified
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]:  6#0119EDE/64000100#011no recovery target specified
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]:  7#011A666/C775F138#011no recovery target specified
2018-12-30_13:07:39 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:39,740 INFO: running pg_rewind from patroni-04-db-gprd.c.gitlab-production.internal
2018-12-30_13:07:43 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:43,038 INFO: running pg_rewind from user=gitlab-superuser host=10.220.16.104 port=5432 dbname=postgres sslmode=prefer sslcompression=1
2018-12-30_13:07:43 patroni-01-db-gprd patroni[43008]:  servers diverged at WAL position A666/C775F138 on timeline 7
2018-12-30_13:07:43 patroni-01-db-gprd patroni[43008]:  rewinding from last common checkpoint at A666/BAE3F0C8 on timeline 7
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:  could not open target file "/var/opt/gitlab/postgresql/data/server.crt": Permission denied
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:  Failure, exiting
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:44,049 ERROR: Failed to rewind from healty master: patroni-04-db-gprd.c.gitlab-production.internal
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:44,054 WARNING: Postgresql is not running.
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:44,054 INFO: Lock owner: patroni-04-db-gprd.c.gitlab-production.internal; I am patroni-01-db-gprd.c.gitlab-production.internal
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:  pg_controldata: could not read file "/var/opt/gitlab/postgresql/data/global/pg_control": read 0 of 264
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:44,056 ERROR: Error when calling pg_controldata
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:  Traceback (most recent call last):
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:    File "/opt/patroni/lib/python3.5/site-packages/patroni/postgresql.py", line 1229, in controldata
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:      data = subprocess.check_output([self._pgcommand('pg_controldata'), self._data_dir], env=env)
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:    File "/usr/lib/python3.5/subprocess.py", line 626, in check_output
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:      **kwargs).stdout
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:    File "/usr/lib/python3.5/subprocess.py", line 708, in run
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:      output=stdout, stderr=stderr)
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:  subprocess.CalledProcessError: Command '['/usr/lib/postgresql/9.6/bin/pg_controldata', '/var/opt/gitlab/postgresql/data']' returned non-zero exit status 1
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:44,150 INFO: Lock owner: patroni-04-db-gprd.c.gitlab-production.internal; I am patroni-01-db-gprd.c.gitlab-production.internal
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:44,161 INFO: starting as a secondary
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:  pg_controldata: could not read file "/var/opt/gitlab/postgresql/data/global/pg_control": read 0 of 264
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:44,164 ERROR: Error when calling pg_controldata
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:  Traceback (most recent call last):
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:    File "/opt/patroni/lib/python3.5/site-packages/patroni/postgresql.py", line 1229, in controldata
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:      data = subprocess.check_output([self._pgcommand('pg_controldata'), self._data_dir], env=env)
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:    File "/usr/lib/python3.5/subprocess.py", line 626, in check_output
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:      **kwargs).stdout
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:    File "/usr/lib/python3.5/subprocess.py", line 708, in run
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:      output=stdout, stderr=stderr)
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:  subprocess.CalledProcessError: Command '['/usr/lib/postgresql/9.6/bin/pg_controldata', '/var/opt/gitlab/postgresql/data']' returned non-zero exit status 1
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:44,165 ERROR: Exception during execution of long running task restarting after failure
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:  Traceback (most recent call last):
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:    File "/opt/patroni/lib/python3.5/site-packages/patroni/async_executor.py", line 97, in run
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:      wakeup = func(*args) if args else func()
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:    File "/opt/patroni/lib/python3.5/site-packages/patroni/postgresql.py", line 1459, in follow
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:      self.start(timeout=timeout)
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:    File "/opt/patroni/lib/python3.5/site-packages/patroni/postgresql.py", line 901, in start
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:      configuration = self._server_parameters if self.role == 'master' else self._build_effective_configuration()
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:    File "/opt/patroni/lib/python3.5/site-packages/patroni/postgresql.py", line 870, in _build_effective_configuration
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:      cvalue = parse_int(data[cname])
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:  KeyError: 'max_connections setting'
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:  localhost:5432 - no response
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:44,194 INFO: Lock owner: patroni-04-db-gprd.c.gitlab-production.internal; I am patroni-01-db-gprd.c.gitlab-production.internal
2018-12-30_13:07:44 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:44,204 INFO: failed to start postgres
2018-12-30_13:07:54 patroni-01-db-gprd patroni[43008]:  2018-12-30 13:07:54,462 WARNING: Postgresql is not running.

What went well

  • Patroni performed a failover which allowed out app to remain up

What did not go well

  • When leader is rotated, we have a few roles where we hard code who that master is, so we need to run chef on many many nodes to force some updates... https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5675
  • node-01 remained healthy in the load balancer causing many writes to be rejected
  • All secondary nodes were reporting 'ERROR: cannot execute INSERT in a read-only transaction' https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5843
  • the wal-e backups alerted became hung and are delayed https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5839
  • the on-call engineer became quickly over whelmed due to many other alerts firing while still investigating the root cause of all the alerts https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5839 https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5841
  • node-01 is unable to be restarted - the postgres service will not stay up - patroni control data returns no data https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5842
  • missing some runbook data on discovering the above information and performing a node rebuild https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5840
  • we need make easier, the process of recovering a failed node - there's a lot of handcrafted steps that can introduce error, and a lengthy amount of time to getting a node back to a usable state. https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5840
  • During repointing nodes to talk to the new leader, pgbouncer on patroni-04 became overloaded and many connections started to become blocked https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5838
  • node-04 was above our alert thresholds for table vacuums immediately when he was elected leader https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5841
  • consul logs on patroni-01 have been lost https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5844
  • SSL Certificates in the data directory may have prevented patroni-01 from being able to participate in the cluster immediately after failover https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5849
  • upgrade patroni https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5856
  • utilize a different consul configuration for patroni https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5857
  • determine if our current count of postgres replicas is kosher during cluster incidents https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5893

Corrective Actions

Candidate for on-gres assistance issue
✔ https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5675
✔ https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5843
❌ https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5839
✔ https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5842
❌ https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5840
❌ https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5838
✔ https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5841
❌ https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5844
❌ https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5849
✔ https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5849
✔ https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5856
✔ https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5857
✔ https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5893
Edited Jan 08, 2019 by John Skarbek
Assignee Loading
Time tracking Loading