investigate RCA from production/issues/1476
RCA Deep Dive
What happened
A network glitch causes quorum loss and trigger a failover routine. As nothing was really wrong with the current leader (patroni-06
), there was no leader change.
Patroni behaves as expected, triggering a failover (thus never completing it, since patroni-06
was detected healthy)
Evidence
Prometheus graph showing network interruption:
At Dec 12 07:31:24
consul report missing contact with the current leader, patroni-06
, followed by a quick re-join
Dec 12 07:31:24 pgbouncer-02-db-gprd consul[26816]: 2019/12/12 07:31:24 [INFO] serf: EventMemberFailed: patroni-06-db-gprd 10.220.16.106
Dec 12 07:31:24 pgbouncer-02-db-gprd consul[26816]: serf: EventMemberFailed: patroni-06-db-gprd 10.220.16.106
Dec 12 07:31:25 pgbouncer-02-db-gprd consul[26816]: 2019/12/12 07:31:25 [INFO] serf: EventMemberJoin: patroni-06-db-gprd 10.220.16.106
Dec 12 07:31:25 pgbouncer-02-db-gprd consul[26816]: serf: EventMemberJoin: patroni-06-db-gprd 10.220.16.106
At 2019-12-12 07:31:26.317 GMT
the same network failure causes an error during walfile upload:
2019-12-12 07:31:26.317 GMT [45589, 0]: [3-1] user=gitlab-replicator,db=[unknown],app=patroni-10-db-gprd.c.gitlab-production.internal,client=10.220.16.110 LOG: disconnection: session time: 461:46:01.162 user=gitlab-replicator database= host=10.220.16.110 port=49464
wal_e.worker.upload INFO MSG: begin archiving a file
DETAIL: Uploading "pg_xlog/0000001F0001862000000009" to "gs://gitlab-gprd-postgres-backup/pitr-wale-v1/wal_005/0000001F0001862000000009.lzo".
STRUCTURED: time=2019-12-12T07:31:26.981242-00 pid=62869 action=push-wal key=gs://gitlab-gprd-postgres-backup/pitr-wale-v1/wal_005/0000001F0001862000000009.lzo prefix=pitr-wale-v1/ seg=0000001F0001862000000009 state=begin
Traceback (most recent call last):
File "/opt/wal-e/lib/python3.5/site-packages/google/cloud/storage/blob.py", line 1158, in upload_from_file
client, file_obj, content_type, size, num_retries, predefined_acl
File "/opt/wal-e/lib/python3.5/site-packages/google/cloud/storage/blob.py", line 1068, in _do_upload
client, stream, content_type, size, num_retries, predefined_acl
File "/opt/wal-e/lib/python3.5/site-packages/google/cloud/storage/blob.py", line 1015, in _do_resumable_upload
response = upload.transmit_next_chunk(transport)
File "/opt/wal-e/lib/python3.5/site-packages/google/resumable_media/requests/upload.py", line 396, in transmit_next_chunk
self._process_response(result, len(payload))
File "/opt/wal-e/lib/python3.5/site-packages/google/resumable_media/_upload.py", line 574, in _process_response
self._get_status_code, callback=self._make_invalid)
File "/opt/wal-e/lib/python3.5/site-packages/google/resumable_media/_helpers.py", line 93, in require_status_code
status_code, u'Expected one of', *status_codes)
google.resumable_media.common.InvalidResponse: ('Request failed with status code', 410, 'Expected one of', <HTTPStatus.OK: 200>, 308)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "src/gevent/greenlet.py", line 766, in gevent._greenlet.Greenlet.run
File "/opt/wal-e/lib/python3.5/site-packages/wal_e/worker/upload.py", line 59, in __call__
self.gpg_key_id)
File "/opt/wal-e/lib/python3.5/site-packages/wal_e/worker/worker_util.py", line 40, in do_lzop_put
k = blobstore.uri_put_file(creds, url, tf)
File "/opt/wal-e/lib/python3.5/site-packages/wal_e/blobstore/gs/utils.py", line 38, in uri_put_file
blob.upload_from_file(fp, size=size, content_type=content_type)
File "/opt/wal-e/lib/python3.5/site-packages/google/cloud/storage/blob.py", line 1162, in upload_from_file
_raise_from_invalid_response(exc)
File "/opt/wal-e/lib/python3.5/site-packages/google/cloud/storage/blob.py", line 2038, in _raise_from_invalid_response
raise exceptions.from_http_status(response.status_code, message, response=response)
google.api_core.exceptions.GoogleAPICallError: 410 PUT https://www.googleapis.com/upload/storage/v1/b/gitlab-gprd-postgres-backup/o?uploadType=resumable&upload_id=AEnB2UomUBipRkns50Aj6FpTMxL-d_WXjfMyLkcAj-9bSnwLLeBl47ZHEC2olCg8340SDRyTC2qbYq_bntvY_9gipuG_v-9PBA: ('Request failed with status code', 410, 'Expected one of', <HTTPStatus.OK: 200>, 308)
2019-12-12T07:31:29Z <Greenlet at 0x7ffb667bf748: <wal_e.worker.upload.WalUploader object at 0x7ffb6668b6d8>(<wal_e.worker.pg.wal_transfer.WalSegment object at)> failed with GoogleAPICallError
wal_e.main CRITICAL MSG: An unprocessed exception has avoided all error handling
DETAIL: Traceback (most recent call last):
File "/opt/wal-e/lib/python3.5/site-packages/google/cloud/storage/blob.py", line 1158, in upload_from_file
client, file_obj, content_type, size, num_retries, predefined_acl
File "/opt/wal-e/lib/python3.5/site-packages/google/cloud/storage/blob.py", line 1068, in _do_upload
client, stream, content_type, size, num_retries, predefined_acl
File "/opt/wal-e/lib/python3.5/site-packages/google/cloud/storage/blob.py", line 1015, in _do_resumable_upload
response = upload.transmit_next_chunk(transport)
File "/opt/wal-e/lib/python3.5/site-packages/google/resumable_media/requests/upload.py", line 396, in transmit_next_chunk
self._process_response(result, len(payload))
File "/opt/wal-e/lib/python3.5/site-packages/google/resumable_media/_upload.py", line 574, in _process_response
self._get_status_code, callback=self._make_invalid)
File "/opt/wal-e/lib/python3.5/site-packages/google/resumable_media/_helpers.py", line 93, in require_status_code
status_code, u'Expected one of', *status_codes)
google.resumable_media.common.InvalidResponse: ('Request failed with status code', 410, 'Expected one of', <HTTPStatus.OK: 200>, 308)
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/opt/wal-e/lib/python3.5/site-packages/wal_e/cmd.py", line 666, in main
concurrency=args.pool_size)
File "/opt/wal-e/lib/python3.5/site-packages/wal_e/operator/backup.py", line 283, in wal_archive
group.join()
File "/opt/wal-e/lib/python3.5/site-packages/wal_e/worker/pg/wal_transfer.py", line 144, in join
raise val
File "src/gevent/greenlet.py", line 766, in gevent._greenlet.Greenlet.run
File "/opt/wal-e/lib/python3.5/site-packages/wal_e/worker/upload.py", line 59, in __call__
self.gpg_key_id)
File "/opt/wal-e/lib/python3.5/site-packages/wal_e/worker/worker_util.py", line 40, in do_lzop_put
k = blobstore.uri_put_file(creds, url, tf)
File "/opt/wal-e/lib/python3.5/site-packages/wal_e/blobstore/gs/utils.py", line 38, in uri_put_file
blob.upload_from_file(fp, size=size, content_type=content_type)
File "/opt/wal-e/lib/python3.5/site-packages/google/cloud/storage/blob.py", line 1162, in upload_from_file
_raise_from_invalid_response(exc)
File "/opt/wal-e/lib/python3.5/site-packages/google/cloud/storage/blob.py", line 2038, in _raise_from_invalid_response
raise exceptions.from_http_status(response.status_code, message, response=response)
google.api_core.exceptions.GoogleAPICallError: 410 PUT https://www.googleapis.com/upload/storage/v1/b/gitlab-gprd-postgres-backup/o?uploadType=resumable&upload_id=AEnB2UomUBipRkns50Aj6FpTMxL-d_WXjfMyLkcAj-9bSnwLLeBl47ZHEC2olCg8340SDRyTC2qbYq_bntvY_9gipuG_v-9PBA: ('Request failed with status code', 410, 'Expected one of', <HTTPStatus.OK: 200>, 308)
STRUCTURED: time=2019-12-12T07:31:29.645893-00 pid=62869
Starting the leader election
Next, every replica with at least one peer with the wal position ahead of themselves, downvotes himself:
patroni-01-db-gprd.c.gitlab-production.internal/syslog.1:Dec 12 07:31:25 patroni-01-db-gprd patroni[58612]: 2019-12-12 07:31:25,249 INFO: Wal position of patroni-09-db-gprd.c.gitlab-production.internal is ahead of my wal position
patroni-02-db-gprd.c.gitlab-production.internal/syslog.1:Dec 12 07:31:25 patroni-02-db-gprd patroni[92454]: 2019-12-12 07:31:25,232 INFO: Wal position of patroni-08-db-gprd.c.gitlab-production.internal is ahead of my wal position
patroni-03-db-gprd.c.gitlab-production.internal/patroni.log.1:2019-12-12_07:31:26 patroni-03-db-gprd patroni[97589]: 2019-12-12 07:31:26,121 INFO: Wal position of patroni-05-db-gprd.c.gitlab-production.internal is ahead of my wal position
patroni-04-db-gprd.c.gitlab-production.internal/syslog.1:Dec 12 07:31:26 patroni-04-db-gprd patroni[38752]: 2019-12-12 07:31:26,162 INFO: Wal position of patroni-02-db-gprd.c.gitlab-production.internal is ahead of my wal position
patroni-05-db-gprd.c.gitlab-production.internal/syslog.1:Dec 12 07:31:25 patroni-05-db-gprd patroni[24338]: 2019-12-12 07:31:25,228 INFO: Wal position of patroni-07-db-gprd.c.gitlab-production.internal is ahead of my wal position
patroni-07-db-gprd.c.gitlab-production.internal/syslog.1:Dec 12 07:31:25 patroni-07-db-gprd patroni[2378]: 2019-12-12 07:31:25,276 INFO: Wal position of patroni-03-db-gprd.c.gitlab-production.internal is ahead of my wal position
patroni-09-db-gprd.c.gitlab-production.internal/patroni.log.1:2019-12-12_07:31:26 patroni-09-db-gprd patroni[59084]: 2019-12-12 07:31:26,153 INFO: Wal position of patroni-03-db-gprd.c.gitlab-production.internal is ahead of my wal position
patroni-10-db-gprd.c.gitlab-production.internal/syslog.1:Dec 12 07:31:25 patroni-10-db-gprd patroni[1712]: 2019-12-12 07:31:25,263 INFO: Wal position of patroni-01-db-gprd.c.gitlab-production.internal is ahead of my wal position
patroni-11-db-gprd.c.gitlab-production.internal/syslog.1:Dec 12 07:31:25 patroni-11-db-gprd patroni[54181]: 2019-12-12 07:31:25,281 INFO: Wal position of patroni-02-db-gprd.c.gitlab-production.internal is ahead of my wal position
patroni-12-db-gprd.c.gitlab-production.internal/syslog.1:Dec 12 07:31:26 patroni-12-db-gprd patroni[59289]: 2019-12-12 07:31:26,117 INFO: Wal position of patroni-01-db-gprd.c.gitlab-production.internal is ahead of my wal position
Leaving patroni-08
as the best candidate for new ledear. Logs from patroni-08
:
2019-12-12_07:31:25 patroni-08-db-gprd patroni[29808]: 2019-12-12 07:31:25,081 INFO: Got response from patroni-05-db-gprd.c.gitlab-production.internal http://10.220.16.105:8009/patroni: b'{"database_system_identifier": "6343687859876602183", "role": "replica", "cluster_unlocked": true, "timeline": 31, "patroni": {"scope": "pg-ha-cluster", "version": "1.6.0"}, "postmaster_start_time": "2019-10-29 09:21:50.461 GMT", "xlog": {"replayed_timestamp": "2019-12-12 07:31:25.054 GMT", "received_location": 428947133460896, "replayed_location": 428947133460896, "paused": false}, "state": "running", "server_version": 90614}'
2019-12-12_07:31:25 patroni-08-db-gprd patroni[29808]: 2019-12-12 07:31:25,081 INFO: Got response from patroni-06-db-gprd.c.gitlab-production.internal http://10.220.16.106:8009/patroni: b'{"state": "running", "database_system_identifier": "6343687859876602183", "timeline": 31, "role": "master", "replication": [{"state": "streaming", "client_addr": "10.220.16.107", "application_name": "patroni-07-db-gprd.c.gitlab-production.internal", "usename": "gitlab-replicator", "sync_priority": 0, "sync_state": "async"}, {"state": "streaming", "client_addr": "10.220.16.109", "application_name": "patroni-09-db-gprd.c.gitlab-production.internal", "usename": "gitlab-replicator", "sync_priority": 0, "sync_state": "async"}, {"state": "streaming", "client_addr": "10.220.16.111", "application_name": "patroni-11-db-gprd.c.gitlab-production.internal", "usename": "gitlab-replicator", "sync_priority": 0, "sync_state": "async"}, {"state": "streaming", "client_addr": "10.220.16.112", "application_name": "patroni-12-db-gprd.c.gitlab-production.internal", "usename": "gitlab-replicator", "sync_priority": 0, "sync_state": "async"}, {"state": "streaming", "client_addr": "10.220.16.108", "application_name": "patroni-08-db-gprd.c.gitlab-production.internal", "usename": "gitlab-replicator", "sync_priority": 0, "sync_state": "async"}, {"state": "streaming", "client_addr": "10.220.16.103", "application_name": "patroni-03-db-gprd.c.gitlab-production.internal", "usename": "gitlab-replicator", "sync_priority": 0, "sync_state": "async"}, {"state": "streaming", "client_addr": "10.220.16.104", "application_name": "patroni-04-db-gprd.c.gitlab-production.internal", "usename": "gitlab-replicator", "sync_priority": 0, "sync_state": "async"}, {"state": "streaming", "client_addr": "10.220.16.102", "application_name": "patroni-02-db-gprd.c.gitlab-production.internal", "usename": "gitlab-replicator", "sync_priority": 0, "sync_state": "async"}, {"state": "streaming", "client_addr": "10.220.16.105", "application_name": "patroni-05-db-gprd.c.gitlab-production.internal", "usename": "gitlab-replicator", "sync_priority":
2019-12-12_07:31:25 patroni-08-db-gprd patroni[29808]: 0, "sync_state": "async"}, {"state": "streaming", "client_addr": "10.220.16.110", "application_name": "patroni-10-db-gprd.c.gitlab-production.internal", "usename": "gitlab-replicator", "sync_priority": 0, "sync_state": "async"}, {"state": "streaming", "client_addr": "10.220.16.101", "application_name": "patroni-01-db-gprd.c.gitlab-production.internal", "usename": "gitlab-replicator", "sync_priority": 0, "sync_state": "async"}], "server_version": 90614, "postmaster_start_time": "2019-10-28 16:07:16.645 GMT", "patroni": {"scope": "pg-ha-cluster", "version": "1.6.0"}, "xlog": {"location": 428947133581640}, "cluster_unlocked": false}'
2019-12-12_07:31:25 patroni-08-db-gprd patroni[29808]: 2019-12-12 07:31:25,084 INFO: Got response from patroni-10-db-gprd.c.gitlab-production.internal http://10.220.16.110:8009/patroni: b'{"postmaster_start_time": "2019-11-23 01:39:04.569 GMT", "cluster_unlocked": true, "timeline": 31, "xlog": {"replayed_timestamp": "2019-12-12 07:31:25.063 GMT", "replayed_location": 428947133566072, "paused": false, "received_location": 428947133566072}, "state": "running", "server_version": 90615, "role": "replica", "database_system_identifier": "6343687859876602183", "patroni": {"version": "1.6.0", "scope": "pg-ha-cluster"}}'
2019-12-12_07:31:25 patroni-08-db-gprd patroni[29808]: 2019-12-12 07:31:25,088 INFO: Got response from patroni-07-db-gprd.c.gitlab-production.internal http://10.220.16.107:8009/patroni: b'{"patroni": {"version": "1.6.0", "scope": "pg-ha-cluster"}, "server_version": 90615, "database_system_identifier": "6343687859876602183", "state": "running", "role": "replica", "cluster_unlocked": true, "timeline": 31, "postmaster_start_time": "2019-10-29 17:48:07.243 GMT", "xlog": {"received_location": 428947133581640, "paused": false, "replayed_location": 428947133581640, "replayed_timestamp": "2019-12-12 07:31:25.070 GMT"}}'
2019-12-12_07:31:25 patroni-08-db-gprd patroni[29808]: 2019-12-12 07:31:25,092 INFO: Got response from patroni-01-db-gprd.c.gitlab-production.internal http://10.220.16.101:8009/patroni: b'{"state": "running", "cluster_unlocked": true, "xlog": {"replayed_location": 428947133581640, "replayed_timestamp": "2019-12-12 07:31:25.070 GMT", "received_location": 428947133582952, "paused": false}, "database_system_identifier": "6343687859876602183", "server_version": 90615, "role": "replica", "postmaster_start_time": "2019-11-27 23:12:39.985 GMT", "patroni": {"scope": "pg-ha-cluster", "version": "1.6.0"}, "timeline": 31}'
2019-12-12_07:31:25 patroni-08-db-gprd patroni[29808]: 2019-12-12 07:31:25,105 INFO: Got response from patroni-09-db-gprd.c.gitlab-production.internal http://10.220.16.109:8009/patroni: b'{"database_system_identifier": "6343687859876602183", "server_version": 90615, "timeline": 31, "state": "running", "cluster_unlocked": true, "xlog": {"paused": false, "replayed_location": 428947133581640, "received_location": 428947133581640, "replayed_timestamp": "2019-12-12 07:31:25.070 GMT"}, "postmaster_start_time": "2019-10-29 09:21:52.573 GMT", "role": "replica", "patroni": {"scope": "pg-ha-cluster", "version": "1.6.0"}}'
2019-12-12_07:31:25 patroni-08-db-gprd patroni[29808]: 2019-12-12 07:31:25,120 INFO: Got response from patroni-02-db-gprd.c.gitlab-production.internal http://10.220.16.102:8009/patroni: b'{"cluster_unlocked": true, "timeline": 31, "state": "running", "postmaster_start_time": "2019-10-29 09:21:50.651 GMT", "patroni": {"scope": "pg-ha-cluster", "version": "1.6.0"}, "xlog": {"received_location": 428947133600792, "replayed_timestamp": "2019-12-12 07:31:25.079 GMT", "paused": false, "replayed_location": 428947133600792}, "database_system_identifier": "6343687859876602183", "server_version": 90614, "role": "replica"}'
2019-12-12_07:31:25 patroni-08-db-gprd patroni[29808]: 2019-12-12 07:31:25,187 INFO: Got response from patroni-04-db-gprd.c.gitlab-production.internal http://10.220.16.104:8009/patroni: b'{"role": "replica", "server_version": 90614, "database_system_identifier": "6343687859876602183", "cluster_unlocked": true, "patroni": {"scope": "pg-ha-cluster", "version": "1.6.0"}, "postmaster_start_time": "2019-10-29 09:21:51.098 GMT", "timeline": 31, "state": "running", "xlog": {"replayed_timestamp": "2019-12-12 07:31:25.067 GMT", "replayed_location": 428947133580968, "paused": false, "received_location": 428947133580968}}'
2019-12-12_07:31:25 patroni-08-db-gprd patroni[29808]: 2019-12-12 07:31:25,190 INFO: Got response from patroni-03-db-gprd.c.gitlab-production.internal http://10.220.16.103:8009/patroni: b'{"patroni": {"version": "1.6.0", "scope": "pg-ha-cluster"}, "state": "running", "timeline": 31, "cluster_unlocked": true, "server_version": 90614, "role": "replica", "database_system_identifier": "6343687859876602183", "xlog": {"replayed_location": 428947133721400, "replayed_timestamp": "2019-12-12 07:31:25.123 GMT", "received_location": 428947133721400, "paused": false}, "postmaster_start_time": "2019-10-29 09:21:50.671 GMT"}'
2019-12-12_07:31:25 patroni-08-db-gprd patroni[29808]: 2019-12-12 07:31:25,202 INFO: Got response from patroni-11-db-gprd.c.gitlab-production.internal http://10.220.16.111:8009/patroni: b'{"patroni": {"scope": "pg-ha-cluster", "version": "1.6.0"}, "postmaster_start_time": "2019-10-29 09:21:50.985 GMT", "cluster_unlocked": true, "timeline": 31, "xlog": {"replayed_location": 428947134134856, "replayed_timestamp": "2019-12-12 07:31:25.173 GMT", "paused": false, "received_location": 428947134134856}, "role": "replica", "database_system_identifier": "6343687859876602183", "server_version": 90615, "state": "running"}'
2019-12-12_07:31:26 patroni-08-db-gprd patroni[29808]: 2019-12-12 07:31:26,088 INFO: Got response from patroni-12-db-gprd.c.gitlab-production.internal http://10.220.16.112:8009/patroni: b'{"role": "replica", "state": "running", "database_system_identifier": "6343687859876602183", "postmaster_start_time": "2019-10-29 09:21:52.046 GMT", "xlog": {"replayed_timestamp": "2019-12-12 07:31:26.067 GMT", "replayed_location": 428947138866080, "received_location": 428947138866080, "paused": false}, "timeline": 31, "server_version": 90615, "cluster_unlocked": true, "patroni": {"scope": "pg-ha-cluster", "version": "1.6.0"}}'
2019-12-12_07:31:26 patroni-08-db-gprd patroni[29808]: 2019-12-12 07:31:26,153 WARNING: Master (patroni-06-db-gprd.c.gitlab-production.internal) is still alive
2019-12-12_07:31:26 patroni-08-db-gprd patroni[29808]: 2019-12-12 07:31:26,173 INFO: following a different leader because i am not the healthiest node
2019-12-12_07:31:34 patroni-08-db-gprd patroni[29808]: 2019-12-12 07:31:34,905 INFO: Lock owner: patroni-06-db-gprd.c.gitlab-production.internal; I am patroni-08-db-gprd.c.gitlab-production.internal
2019-12-12_07:31:34 patroni-08-db-gprd patroni[29808]: 2019-12-12 07:31:34,917 INFO: changing primary_conninfo and restarting in progress
So, as nothing really were wrong with patroni-06
and was still online, leader role was preserved, and timeline is still 31:
+---------------+-------------------------------------------------+---------------+--------+---------+----+-----------+
| Cluster | Member | Host | Role | State | TL | Lag in MB |
+---------------+-------------------------------------------------+---------------+--------+---------+----+-----------+
| pg-ha-cluster | patroni-01-db-gprd.c.gitlab-production.internal | 10.220.16.101 | | running | 31 | 39 |
| pg-ha-cluster | patroni-02-db-gprd.c.gitlab-production.internal | 10.220.16.102 | | running | 31 | 43 |
| pg-ha-cluster | patroni-03-db-gprd.c.gitlab-production.internal | 10.220.16.103 | | running | 31 | 32 |
| pg-ha-cluster | patroni-04-db-gprd.c.gitlab-production.internal | 10.220.16.104 | | running | 31 | 39 |
| pg-ha-cluster | patroni-05-db-gprd.c.gitlab-production.internal | 10.220.16.105 | | running | 31 | |
| pg-ha-cluster | patroni-06-db-gprd.c.gitlab-production.internal | 10.220.16.106 | Leader | running | 31 | 0 |
| pg-ha-cluster | patroni-07-db-gprd.c.gitlab-production.internal | 10.220.16.107 | | running | 31 | 51 |
| pg-ha-cluster | patroni-08-db-gprd.c.gitlab-production.internal | 10.220.16.108 | | running | 31 | 46 |
| pg-ha-cluster | patroni-09-db-gprd.c.gitlab-production.internal | 10.220.16.109 | | running | 31 | 42 |
| pg-ha-cluster | patroni-10-db-gprd.c.gitlab-production.internal | 10.220.16.110 | | running | 31 | 31 |
| pg-ha-cluster | patroni-11-db-gprd.c.gitlab-production.internal | 10.220.16.111 | | running | 31 | 50 |
| pg-ha-cluster | patroni-12-db-gprd.c.gitlab-production.internal | 10.220.16.112 | | running | 31 | 34 |
+---------------+-------------------------------------------------+---------------+--------+---------+----+-----------+
Consecuences
- ~ 1 minute downtime in the replicas ( because of the restarting)
Corrective actions
- Investigate the usage consul.checks:[] at patroni config to be more resilient ti short network glitches.