Project 'gitlab-com/gl-infra/infrastructure' was moved to 'gitlab-com/gl-infra/production-engineering'. Please update any links and bookmarks that may still have the old path.
RCA for 2019-08-14 long degradation after postgres failover
This issue does not appear to have an issue weight set.
As a general guidelines use a weight of 1 for an access request issue or a simple
configuration update. Use this as a multiplier for setting the weight.
If you are unsure about what weight to set it is better to add a generous estimate and change it later.
If the weight on this issue is 8 or larger then it might be a good idea
to consider splitting this issue up into smaller pieces.
I'm not convinced that the patroni failover was caused by a network blip.
Here are the patroni logs from the failover (I've removed tracebacks for brevity)
Highlights from the logs....
2019-08-14_08:24:57 all good
2019-08-14_08:25:07 all good
2019-08-14_08:25:17 all good
2019-08-14_08:25:27 all good
2019-08-14_08:25:37 all good
NOTHING LOGGED AT 08:25:47 . This is either caused by a timeout or possibly a kernel lockup. We should check other logs for evidence of the latte.r
2019-08-14_08:25:57 things are not good
2019-08-14_08:25:57 Read timeout 127.0.0.1:8500 (read timeout=3s) Unable to communicate with localhost!
2019-08-14_08:25:57 Unable to communicate with consul at https://localhost:8500/v1/kv/service/pg-ha-cluster/?recurse=1 Unable to communicate with localhost!
2019-08-14_08:25:57 'During handling of the above exception, another exception occurred: Exceeded retry deadline' - this is either a misconfiguration on our side, a bug in patroni or further evidence of a kernel lockup (possibly since the missing logs at 08:25:47.
2019-08-14_08:27:20 postmaster restart
2019-08-14_08:27:23 -> 2019-08-14_08:27:30 no response connecting to localhost:5432 (postgres service)
2019-08-14_08:27:32 localhost:5432 - accepting connections Weird sequence: no response, rejecting, accepting. If the service just started, it would be rejecting, accepting
2019-08-14_08:27:36 2019-08-14 08:27:36,285 ERROR: get_postgresql_status timeout expired Even after restart (16 seconds ago) and accepting connections, server not responding?
Summary
Unable to communicate with consul on localhost
Unable to communicate with postgres on localhost
"No response" from postgres on localhost
After accept, postgres remaining unresponsive
These events don't seem like a cloud network outage to me. Had it been a cloud provider event, communication with localhost would have been available.
Another question is: why did we miss polling logs immediately before the incident?
Full logs
2019-08-14_08:24:57 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:24:57,120 INFO: no action. i am the leader with the lock2019-08-14_08:25:07 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:25:07,098 INFO: Lock owner: patroni-01-db-gprd.c.gitlab-production.internal; I am patroni-01-db-gprd.c.gitlab-production.internal2019-08-14_08:25:07 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:25:07,123 INFO: no action. i am the leader with the lock2019-08-14_08:25:17 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:25:17,097 INFO: Lock owner: patroni-01-db-gprd.c.gitlab-production.internal; I am patroni-01-db-gprd.c.gitlab-production.internal2019-08-14_08:25:17 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:25:17,120 INFO: no action. i am the leader with the lock2019-08-14_08:25:27 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:25:27,096 INFO: Lock owner: patroni-01-db-gprd.c.gitlab-production.internal; I am patroni-01-db-gprd.c.gitlab-production.internal2019-08-14_08:25:27 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:25:27,119 INFO: no action. i am the leader with the lock2019-08-14_08:25:37 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:25:37,096 INFO: Lock owner: patroni-01-db-gprd.c.gitlab-production.internal; I am patroni-01-db-gprd.c.gitlab-production.internal2019-08-14_08:25:37 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:25:37,119 INFO: no action. i am the leader with the lock2019-08-14_08:25:57 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:25:57,819 ERROR: get_cluster2019-08-14_08:25:57 patroni-01-db-gprd patroni[17041]: socket.timeout: timed out2019-08-14_08:25:57 patroni-01-db-gprd patroni[17041]: During handling of the above exception, another exception occurred:2019-08-14_08:25:57 patroni-01-db-gprd patroni[17041]: urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='127.0.0.1', port=8500): Read timed out. (read timeout=3.3333333333333335)2019-08-14_08:25:57 patroni-01-db-gprd patroni[17041]: During handling of the above exception, another exception occurred:2019-08-14_08:25:57 patroni-01-db-gprd patroni[17041]: urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=8500): Max retries exceeded with url: /v1/kv/service/pg-ha-cluster/?recurse=1 (Caused by ReadTimeoutError("HTTPConnectionPool(host='127.0.0.1', port=8500): Read timed out. (read timeout=3.3333333333333335)",))2019-08-14_08:25:57 patroni-01-db-gprd patroni[17041]: During handling of the above exception, another exception occurred:2019-08-14_08:25:57 patroni-01-db-gprd patroni[17041]: patroni.utils.RetryFailedError: 'Exceeded retry deadline'2019-08-14_08:25:57 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:25:57,820 ERROR: Error communicating with DCS2019-08-14_08:27:16 patroni-01-db-gprd patroni[17041]: BrokenPipeError: [Errno 32] Broken pipe2019-08-14_08:27:16 patroni-01-db-gprd patroni[17041]: BrokenPipeError: [Errno 32] Broken pipe2019-08-14_08:27:16 patroni-01-db-gprd patroni[17041]: BrokenPipeError: [Errno 32] Broken pipe2019-08-14_08:27:16 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:27:16,172 INFO: closed patroni connection to the postgresql cluster2019-08-14_08:27:20 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:27:20,375 INFO: postmaster pid=491782019-08-14_08:27:23 patroni-01-db-gprd patroni[17041]: localhost:5432 - no response2019-08-14_08:27:24 patroni-01-db-gprd patroni[17041]: localhost:5432 - no response2019-08-14_08:27:25 patroni-01-db-gprd patroni[17041]: localhost:5432 - no response2019-08-14_08:27:26 patroni-01-db-gprd patroni[17041]: localhost:5432 - no response2019-08-14_08:27:27 patroni-01-db-gprd patroni[17041]: localhost:5432 - no response2019-08-14_08:27:28 patroni-01-db-gprd patroni[17041]: localhost:5432 - no response2019-08-14_08:27:29 patroni-01-db-gprd patroni[17041]: localhost:5432 - no response2019-08-14_08:27:30 patroni-01-db-gprd patroni[17041]: localhost:5432 - no response2019-08-14_08:27:32 patroni-01-db-gprd patroni[17041]: localhost:5432 - rejecting connections2019-08-14_08:27:32 patroni-01-db-gprd patroni[17041]: localhost:5432 - accepting connections2019-08-14_08:27:32 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:27:32,807 INFO: demoted self because DCS is not accessible and i was a leader2019-08-14_08:27:32 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:27:32,809 WARNING: Loop time exceeded, rescheduling immediately.2019-08-14_08:27:32 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:27:32,814 INFO: establishing a new patroni connection to the postgres cluster2019-08-14_08:27:36 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:27:36,285 ERROR: get_postgresql_status2019-08-14_08:27:36 patroni-01-db-gprd patroni[17041]: psycopg2.OperationalError: timeout expired2019-08-14_08:27:36 patroni-01-db-gprd patroni[17041]: During handling of the above exception, another exception occurred:2019-08-14_08:27:36 patroni-01-db-gprd patroni[17041]: patroni.exceptions.PostgresConnectionException: 'connection problems'2019-08-14_08:27:39 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:27:39,609 ERROR: get_postgresql_status2019-08-14_08:27:39 patroni-01-db-gprd patroni[17041]: psycopg2.OperationalError: timeout expired2019-08-14_08:27:39 patroni-01-db-gprd patroni[17041]: During handling of the above exception, another exception occurred:2019-08-14_08:27:39 patroni-01-db-gprd patroni[17041]: patroni.exceptions.PostgresConnectionException: 'connection problems'2019-08-14_08:27:51 patroni-01-db-gprd patroni[17041]: BrokenPipeError: [Errno 32] Broken pipe2019-08-14_08:27:51 patroni-01-db-gprd patroni[17041]: BrokenPipeError: [Errno 32] Broken pipe2019-08-14_08:27:51 patroni-01-db-gprd patroni[17041]: BrokenPipeError: [Errno 32] Broken pipe2019-08-14_08:27:51 patroni-01-db-gprd patroni[17041]: BrokenPipeError: [Errno 32] Broken pipe2019-08-14_08:27:51 patroni-01-db-gprd patroni[17041]: socket.timeout: timed out2019-08-14_08:27:51 patroni-01-db-gprd patroni[17041]: During handling of the above exception, another exception occurred:2019-08-14_08:27:51 patroni-01-db-gprd patroni[17041]: urllib3.exceptions.ReadTimeoutError: HTTPConnectionPool(host='127.0.0.1', port=8500): Read timed out. (read timeout=3.3333333333333335)2019-08-14_08:27:51 patroni-01-db-gprd patroni[17041]: During handling of the above exception, another exception occurred:2019-08-14_08:27:51 patroni-01-db-gprd patroni[17041]: urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=8500): Max retries exceeded with url: /v1/session/create (Caused by ReadTimeoutError("HTTPConnectionPool(host='127.0.0.1', port=8500): Read timed out. (read timeout=3.3333333333333335)",))2019-08-14_08:27:51 patroni-01-db-gprd patroni[17041]: During handling of the above exception, another exception occurred:2019-08-14_08:27:51 patroni-01-db-gprd patroni[17041]: patroni.utils.RetryFailedError: 'Exceeded retry deadline'2019-08-14_08:27:51 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:27:51,288 ERROR: Error communicating with DCS2019-08-14_08:27:51 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:27:51,288 INFO: DCS is not accessible2019-08-14_08:27:51 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:27:51,289 WARNING: Loop time exceeded, rescheduling immediately.2019-08-14_08:27:57 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:27:57,566 INFO: Lock owner: patroni-04-db-gprd.c.gitlab-production.internal; I am patroni-01-db-gprd.c.gitlab-production.internal2019-08-14_08:27:57 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:27:57,567 INFO: does not have lock2019-08-14_08:27:57 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:27:57,582 INFO: Local timeline=23 lsn=128E2/A90000982019-08-14_08:27:58 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:27:57,630 ERROR: Exception when working with master via replication connection2019-08-14_08:27:58 patroni-01-db-gprd patroni[17041]: psycopg2.OperationalError: FATAL: remaining connection slots are reserved for non-replication superuser connections2019-08-14_08:27:58 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:27:58,916 INFO: no action. i am a secondary and i am following a leader2019-08-14_08:27:59 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:27:59,705 INFO: postmaster pid=494302019-08-14_08:28:01 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:01,293 INFO: Lock owner: patroni-04-db-gprd.c.gitlab-production.internal; I am patroni-01-db-gprd.c.gitlab-production.internal2019-08-14_08:28:01 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:01,304 INFO: changing primary_conninfo and restarting in progress2019-08-14_08:28:01 patroni-01-db-gprd patroni[17041]: localhost:5432 - accepting connections2019-08-14_08:28:01 patroni-01-db-gprd patroni[17041]: localhost:5432 - accepting connections2019-08-14_08:28:01 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:01,488 INFO: Lock owner: patroni-04-db-gprd.c.gitlab-production.internal; I am patroni-01-db-gprd.c.gitlab-production.internal2019-08-14_08:28:01 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:01,488 INFO: does not have lock2019-08-14_08:28:01 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:01,488 INFO: establishing a new patroni connection to the postgres cluster2019-08-14_08:28:01 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:01,504 INFO: Local timeline=23 lsn=128E2/A90000982019-08-14_08:28:01 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:01,550 ERROR: Exception when working with master via replication connection2019-08-14_08:28:01 patroni-01-db-gprd patroni[17041]: psycopg2.OperationalError: FATAL: remaining connection slots are reserved for non-replication superuser connections2019-08-14_08:28:01 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:01,600 INFO: no action. i am a secondary and i am following a leader2019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:11,757 INFO: Lock owner: patroni-04-db-gprd.c.gitlab-production.internal; I am patroni-01-db-gprd.c.gitlab-production.internal2019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:11,757 INFO: does not have lock2019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:11,774 INFO: Local timeline=23 lsn=128E2/A90000982019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:11,826 INFO: master_timeline=242019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:11,826 INFO: master: history=1#011D2/AE3F3440#011no recovery target specified2019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 2#0112024/9000098#011no recovery target specified2019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 3#0113A75/90000098#011no recovery target specified2019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 4#0115C10/3000098#011no recovery target specified2019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 5#0117C6A/7000000#011no recovery target specified2019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 6#0119EDE/64000100#011no recovery target specified2019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 7#011A666/C775F138#011no recovery target specified2019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 8#011A87D/72000098#011no recovery target specified2019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 9#011B948/82000098#011no recovery target specified2019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 10#011FCF1/D708268#011no recovery target specified2019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 11#011FCF2/9F2B00C0#011no recovery target specified2019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 12#011FCF4/8000098#011no recovery target specified2019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 13#011FCF4/13000098#011no recovery target specified2019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 14#011FCF4/32000098#011no recovery target specified2019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 15#011FCF5/1A000098#011no recovery target specified2019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 16#011FCF5/90000098#011no recovery target specified2019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 17#011FCF5/99000098#011no recovery target specified2019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 18#011FCF5/C4000098#011no recovery target specified2019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 19#011FCF6/99000098#011no recovery target specified2019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 20#01111429/35000098#011no recovery target specified2019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 21#01111708/2A632E88#011no recovery target specified2019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 22#01112437/95000098#011no recovery target specified2019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 23#011128E2/88F0B1A8#011no recovery target specified2019-08-14_08:28:11 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:11,840 INFO: running pg_rewind from patroni-04-db-gprd.c.gitlab-production.internal2019-08-14_08:28:21 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:21,488 INFO: Lock owner: patroni-04-db-gprd.c.gitlab-production.internal; I am patroni-01-db-gprd.c.gitlab-production.internal2019-08-14_08:28:21 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:21,499 INFO: running pg_rewind from patroni-04-db-gprd.c.gitlab-production.internal in progress2019-08-14_08:28:27 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:27,958 INFO: running pg_rewind from user=gitlab-superuser host=10.220.16.104 port=5432 dbname=postgres sslmode=prefer sslcompression=12019-08-14_08:28:28 patroni-01-db-gprd patroni[17041]: servers diverged at WAL position 128E2/88F0B1A8 on timeline 232019-08-14_08:28:28 patroni-01-db-gprd patroni[17041]: rewinding from last common checkpoint at 128E2/61CFD690 on timeline 232019-08-14_08:28:31 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:31,489 INFO: Lock owner: patroni-04-db-gprd.c.gitlab-production.internal; I am patroni-01-db-gprd.c.gitlab-production.internal2019-08-14_08:28:31 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:31,492 INFO: running pg_rewind from patroni-04-db-gprd.c.gitlab-production.internal in progress2019-08-14_08:28:41 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:41,489 INFO: Lock owner: patroni-04-db-gprd.c.gitlab-production.internal; I am patroni-01-db-gprd.c.gitlab-production.internal2019-08-14_08:28:41 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:41,489 INFO: running pg_rewind from patroni-04-db-gprd.c.gitlab-production.internal in progress2019-08-14_08:28:51 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:51,488 INFO: Lock owner: patroni-04-db-gprd.c.gitlab-production.internal; I am patroni-01-db-gprd.c.gitlab-production.internal2019-08-14_08:28:51 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:51,491 INFO: running pg_rewind from patroni-04-db-gprd.c.gitlab-production.internal in progress2019-08-14_08:28:53 patroni-01-db-gprd patroni[17041]: unexpected result while fetching remote files: ERROR: canceling statement due to statement timeout2019-08-14_08:28:53 patroni-01-db-gprd patroni[17041]: Failure, exiting2019-08-14_08:28:53 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:53,378 ERROR: Failed to rewind from healty master: patroni-04-db-gprd.c.gitlab-production.internal2019-08-14_08:28:53 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:53,383 WARNING: Postgresql is not running.2019-08-14_08:28:53 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:53,383 INFO: Lock owner: patroni-04-db-gprd.c.gitlab-production.internal; I am patroni-01-db-gprd.c.gitlab-production.internal2019-08-14_08:28:53 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:28:53,399 INFO: doing crash recovery in a single user mode2019-08-14_08:29:03 patroni-01-db-gprd patroni[17041]: 2019-08-14 08:29:03,384 INFO: Lock owner: patroni-04-db-gprd.c.gitlab-production.internal; I am patroni-01-db-gprd.c.gitlab-production.internal
These lines are logged when another consult agent suspects the node to be failed and pings it. The node can respond by "refuting" the potential outage. When this happens, these lines are logged.
Pointing towards a network outage:
We had 4 refute in a short time (over the course of the whole day we had 30 in total). This means 4 other nodes suspected this machine of having failed.
Pointing away from a network outage:
All 4 refute messages were received (on the network!) during the time that connections to localhost where failing.
Additional data point: the kernel logs on patroni-01 report 3 "page allocation failure" events about 90 minutes before everything went bad, at 06:54:46, 07:00:52, and 07:05:07. Possibly co-incidental; 90 minutes is a long time, but suggests perhaps memory pressure (not consistent with graphs), or perhaps something weird in the kernel memory handling.
The stack traces don't explicitly mention localhost vs off-host, although I suppose that given most mention IRQ handling and virtnet_poll that they're probably not localhost related. One was in just TCP handling though:
Aug 13 07:00:52 patroni-01-db-gprd kernel: [18806711.624041] __alloc_pages_slowpath+0x1c1/0xb20
Aug 13 07:00:52 patroni-01-db-gprd kernel: [18806711.624043] ? ip_finish_output+0x176/0x270
Aug 13 07:00:52 patroni-01-db-gprd kernel: [18806711.624044] ? lock_timer_base+0x7d/0xa0
Aug 13 07:00:52 patroni-01-db-gprd kernel: [18806711.624045] __alloc_pages_nodemask+0x21a/0x2a0
Aug 13 07:00:52 patroni-01-db-gprd kernel: [18806711.624046] alloc_pages_current+0x95/0x140
Aug 13 07:00:52 patroni-01-db-gprd kernel: [18806711.624048] skb_page_frag_refill+0xcc/0xf0
Aug 13 07:00:52 patroni-01-db-gprd kernel: [18806711.624049] sk_page_frag_refill+0x21/0x80
Aug 13 07:00:52 patroni-01-db-gprd kernel: [18806711.624051] tcp_sendmsg+0x22c/0xc20
Aug 13 07:00:52 patroni-01-db-gprd kernel: [18806711.624053] inet_sendmsg+0x31/0xb0
Aug 13 07:00:52 patroni-01-db-gprd kernel: [18806711.624054] sock_sendmsg+0x38/0x50
Aug 13 07:00:52 patroni-01-db-gprd kernel: [18806711.624056] SYSC_sendto+0x101/0x190
Aug 13 07:00:52 patroni-01-db-gprd kernel: [18806711.624057] ? syscall_trace_enter+0x1d9/0x2f0
Aug 13 07:00:52 patroni-01-db-gprd kernel: [18806711.624059] ? __audit_syscall_exit+0x230/0x2c0
Aug 13 07:00:52 patroni-01-db-gprd kernel: [18806711.624061] SyS_sendto+0xe/0x10
Aug 13 07:00:52 patroni-01-db-gprd kernel: [18806711.624062] do_syscall_64+0x5b/0xc0
Aug 13 07:00:52 patroni-01-db-gprd kernel: [18806711.624063] entry_SYSCALL64_slow_path+0x25/0x25
Which suggests the kernel was having difficulty managing its internal memory, and it was affecting the TCP stack. But again, 90 minutes before, so other than hinting, it's not directly correlated.
The stack traces don't explicitly mention localhost vs off-host, although I suppose that given most mention IRQ handling and virtnet_poll that they're probably not localhost related. One was in just TCP handling though:
Now I see.. tcp_sendmsg.. That is interesting and concerning, but I agree 90 minutes between that and the failure is a very long time on a busy server :-/
Is there any good reason that a straight INSERT would take 5 seconds. This probably indicates that things were bad on the server before the patroni logs indicated as much.
We discussed this internally with @ahachete and one of our observations was: The time of the base backup downstream and its restore process is large enough to be out of the WAL retention capacity of the leader, causing Postgres to retrieve segments through restore_command (***) . The restore_command executes wal-e per each WAL segment, leading to an unnecessary back and forth between GCS and the restoring node. Once the recovery starts to fetch WAL that is kept on the cluster, it speeds up as archive_command has no effect and the stream is entirely managed by Postgres backends.
The restore_command is configured through Patroni (*) and by now, it is calling wal-e on each archive request. A possible workaround would be to wrapper this tool into a script, allowing downstream S3 objects in parallel (**).
Pseudocode (don't want to extend work on this on this thread, but it will be nice to check the idea before materializing into issues):
foreach ${3} /usr/bin/envdir ${2} /opt/wal-e/bin/wal-e wal-fetch ${3} ${4} &wait files to be 16MB || return errorreturn 0
(***)
Right now, that retention is barely 30 minutes, leaving two choices: 1) increase wal_keep_segments (but this is a short-term workaround) 2) set an intermediate node to serve as a WAL archiver (could be DR node or similar and, the wrapper script could verify the availability of this node). Although not sure how 2) could affect product integration, so it may not probably the fittest solution.
After the failover, why did one read replica (06) have a disproportionately large number of client connections compared to the others, causing performance degradation until we were able to restore more replicas?
Andrew Newdigatechanged title from RCA for 2018-08-14 long degradation after postgres failover to RCA for 2019-08-14 long degradation after postgres failover
changed title from RCA for 2018-08-14 long degradation after postgres failover to RCA for 2019-08-14 long degradation after postgres failover
Thank you all for your reports and investigation. Very insightful. Here is what we observed, acted on and some ideas and recommendations.
State of old primary after the failover
pg_rewind failed indeed, due to the current setting on statement_timeout. It is the master node, from where pg_rewind is connecting, who triggers the timeout, not the rewinding node. It should be changed in every node, as the configuration needs to be consistent across all nodes. There are two options to do this without affecting all the connections:
alter user "gitlab-superuser" set statement_timeout=0 or equivalent command, to disable the statement timeout for the gitlab-superuser user --which would not be a bad thing anyway, unless any application connects with this user, but hopefully is not the case.
Configure/patch needed? for patroni to pass this as a configuration option (-c) to the source connection string of the pg_rewind command.
But in this case, probably 01 would not have been brought online anyway. There were serious issues with it.
2019-08-14 08:59:35.508 GMT [59997, 0]: [1-1] user=postgres_exporter,db=gitlabhq_production,app=[unknown],client=127.0.0.1 FATAL: the database system is starting up2019-08-14 08:59:35.569 GMT [59981, 0]: [2-1] user=,db=,app=,client= LOG: entering standby mode2019-08-14 08:59:35.569 GMT [59981, 0]: [3-1] user=,db=,app=,client= LOG: could not read from log segment 00000018000128E300000010, offset 0: No such file or directory2019-08-14 08:59:35.570 GMT [59981, 0]: [4-1] user=,db=,app=,client= LOG: invalid primary checkpoint record2019-08-14 08:59:35.570 GMT [59981, 0]: [5-1] user=,db=,app=,client= LOG: could not read from log segment 00000018000128E30000000E, offset 0: Success2019-08-14 08:59:35.570 GMT [59981, 0]: [6-1] user=,db=,app=,client= LOG: invalid secondary checkpoint record2019-08-14 08:59:35.570 GMT [59981, 0]: [7-1] user=,db=,app=,client= PANIC: could not locate a valid checkpoint record
Checking the integrity of the two referenced WAL files lead to empty files:
$ ls -la 00000018000128E300000010-rw------- 1 gitlab-psql gitlab-psql 0 Aug 14 08:28 00000018000128E300000010$ ls -la 00000018000128E30000000E-rw------- 1 gitlab-psql gitlab-psql 0 Aug 14 08:28 00000018000128E30000000E
They were not the only ones:
$ find /var/opt/gitlab/postgresql/20190814-failed_pg_xlog_files -type f -not -size 16M |wc -l2045
These files were put into quarantine on the above directory. Even worse, another WAL file, 00000017000128DB00000069, was found to have size of exactly 12000000 (should be 16777216), which is obviously corrupted. Note that PostgreSQL always operates on multiples of PAGE_SIZE, which is by default (and on this case) 8Kb. So a WAL file can only be a multiple of that size, which 12M is not. On a possibly related note, an lzo defunct process was identified on the server. A hypothesis is that if there was a network hiccup, wal-e's fetch process may have been interrupted, leaving corrupted WAL files. It may be explained by the lzo defunct process, which is fired in turn by wal-e to decompress the received WAL files. 12000000 is a suspicious number, which may have been resulted from the lzo decompressing process, which surely operates on blocks of a fixed size. And it would mean that wal-e doesn't work completely atomically, which would be a bad thing.
Recoverying the old primary
The WAL files with incorrect size were moved into quarantine directory. A pg_rewind was run:
$ /usr/lib/postgresql/9.6/bin/pg_rewind -D /var/opt/gitlab/postgresql/data/ -n --source-server="host=patroni-04-db-gprd.c.gitlab-production.internal password=<redacted> user=gitlab-superuser dbname=postgres"source and target cluster are on the same timelineno rewind required
but decided to take no action, claiming that 01 was already on the correct timeline. This may have been an effect of the partial execution (before being killed by the statement timeout) of the initial pg_rewind carried out by Patroni when re-configuring the server as a replica. At this point, Patroni was trying to restart the database continuously.
Both patroni and pgbouncer services were stopped on the server. A manual recovery strategy for 01 was planned, involving starting postgresqlwithout the intervention of Patroni, to force a recovery from the WAL archives (via wal-e). This proved to be more challenging than expected, and may be submitted as feedback to Patroni (due to the fact that Patroni manages the PostgreSQL config, if trying to boot "standalone" there is no configuration or anything around).
A basic configuration was created for both postgresql.conf and recovery.conf. The relevant fact is that it was a configuration to allow the server to boot and the recovery process to fetch from the WAL archive. Local files were backed up to be restored to the expected Patroni state afterwards.
This strategy worked. PostgreSQL was able to start and enter into recovery mode, while fetching files from the archive via wal-e. Process is slow, however, and may be improved. 01 was applying WALs but the master server was also generating WALs at a high pace. Catch-up was slow and took approximately one hour.
An unused slot on the master (created by the process to boostratp the 07 server) was leveraged to switch (manually) the recovery process to Streaming Replication. Once the latest checkpoint's REDO WAL file (as reported by pg_controldata) was on a later position than the slot, PostgreSQL was stopped, and Patroni started. Patroni was able to start and continue automatically now the recovery process via Streaming Replication. This was order of magnitudes faster than archive recovery and 01 caught up with master in a few minutes.
At this point 01 was online, synced and consul maintenance was stopped. 01 started receiving traffic and alleviated the load on the other servers, most notably 06.
Uneven load on the read replicas
We decided to monitor de CPU usage of the core that pgbouncer uses for every node of the cluster. Since all the connections go through the local pgbouncers, and pgbouncer is single-threaded (it can only use one core), this might be a good proxy of the load each pgbouncer is receiving. AFAIK there are no metrics in Prometheus for monitoring this.
A script based on this snippet (but automated through SSH to run on all hosts) was used:
CPU usage was queries every 15 seconds and is graphed below:
This demonstrates that the (core CPU load) is uneven. 06 is the most loaded. CPU load based on this method has been monitored at some intervals and it is at the time of this writing at lower levels.
However, the relevant fact was that before recording this information, it was informally monitored on 06 before and after 01 was recovered. And load was very high, over 99%. Since pgbouncer is single-threaded, once it maxes out the CPU core, it enters a state where it cannot process all incoming connections due to CPU contention, and queuing happens that leads to the Waiting Client Connections spiking to over 1K. When 01 was added, CPU load on 06 dropped below 98%, a point at which the number of waiting connections started to also descend.
In other words: Waiting Client Connections start to appear when CPU load on the pgbouncer process gets close to 100%. And then suddenly pgbouncer becomes a very serious performance bottleneck. This needs to be addressed with high priority, as CPU load on current pgbouncers is not particularly low. This leads me believe that unplanned spikes may result in pgbouncer saturation and are, probably, at the heart of several of the performance degradations observed in the past, too.
Ideas on why the failover happened
I don't have conclusive information now, and there's very useful comments on this thread. I suggest two main factors, which might be inter-related:
Network hiccup. Patroni failed to renew the leader lock at 2019-08-14_08:25:57 and started the (initial, failed) recovery at 2019-08-14_08:28:01. All the 0-sized, corrupted WAL files have been created at Aug 14 08:28. If the network was not working correctly at this time, it may explain both network errors.
Server issues (probably from the underlying hardware or virtualization mechanism). As has been mentioned here, kernel logs of 01 have several kernel stack traces and even (past) disk access failures:
[Mon Jul 8 06:09:16 2019] EXT4-fs (sdc): error count since last fsck: 14[Mon Jul 8 06:09:16 2019] EXT4-fs (sdc): initial error at time 1559062885: ext4_reserve_inode_write:5537[Mon Jul 8 06:09:16 2019] EXT4-fs (sdc): last error at time 1559111108: ext4_discard_preallocations:4046[Mon Jul 8 22:55:29 2019] EXT4-fs error (device sdc): ext4_wait_block_bitmap:503: comm svlogd: Cannot read block bitmap - block_group = 88, block_bitmap = 2621448[Mon Jul 8 22:55:29 2019] EXT4-fs error (device sdc): ext4_wait_block_bitmap:503: comm svlogd: Cannot read block bitmap - block_group = 387, block_bitmap = 12582915[Mon Jul 8 22:55:29 2019] EXT4-fs error (device sdc): ext4_discard_preallocations:4038: comm svlogd: Error loading buddy information for 387[Mon Jul 8 22:55:29 2019] EXT4-fs error (device sdc): ext4_discard_preallocations:4038: comm svlogd: Error loading buddy information for 88[Mon Jul 8 22:55:29 2019] EXT4-fs error (device sdc): ext4_wait_block_bitmap:503: comm svlogd: Cannot read block bitmap - block_group = 88, block_bitmap = 2621448[Mon Jul 8 22:55:29 2019] EXT4-fs error (device sdc): ext4_discard_preallocations:4038: comm svlogd: Error loading buddy information for 88[Mon Jul 8 22:55:29 2019] VFS: Dirty inode writeback failed for block device sdc (err=-5).
certainly above failures are from the past and refer to the PostgreSQL logs disk (sdc) but nonetheless reveal that some disk component was not working correctly, and could have been related to the network.
There might be a time correlation. It could even be that the spike on the duration of the (contending) queries is a symptom. But I highly doubt is a cause. How can it be that a node fails (i.e., cannot re-acquire the leader lock) under a higher load during a set of contending queries? PostgreSQL is much more reliable than that, and should continue operating even in the worst scenarios of ill-crafted queries and high load. Queries may experience bad latencies, but that shouldn't prevent the whole software from running properly, and (in the absence of absolute CPU overload, which wasn't the case), responding in a timely manner to the lock renewals.
Different story might be that, if the disk is failing (due to, for instance, a network hiccup) queries may experience longer latencies, even worsened if they are queries contending for a lock, in which case they can easily and quickly pile. That's why I consider this more a symptom that a cause.
The following slowlog entry reveals a 5 seconds INSERT. This is hardly explained by anything different than a disk/network issue:
and this happened shortly before the node failed to renew the leader lock.
Recommendations and action items
Remove the statement timeout for the gitlab-superuser user or remove it from the pg_rewind command issued by Patroni.
Monitor the CPU load of the pgbouncer core, and alert if it gets closer to 100%.
Scale pgbouncers. If they saturate, performance degrades. And this might happen under spikes of different nature. We may start by making sure the core of pgbouncer is pinned to an OS, dedicated core. Then we might analyze whether to have more than one pgbouncer, and a load balancer like haproxy to distribute load among them; to use external (on a separate note) pgbouncer on machines with the fastest cores available (in terms of GHz), even they may have just 2 or 4 cores; or consider a multi-threaded connection pooler like Odyssey.
We might want to consider having every node send/receive a continuous stream of low bandwidth traffic, and monitor it. Something like a permanently running iptraf. This cannot be 1m resolution time, as some already available metrics on Prometheus. With this, we might be able to see clearly if there were network hiccups.
Further investigate and/or test to see if we can reproduce the corruption of the WAL files, and determine if it was because of the pg_rewind process being killed by the statement timeout process, or because of the network hiccup that made wal-e to fail non-atomically. If the latter, it would require a significant thought on the use of wal-e or the need to fix it. It cannot be non-atomic, and if this would be the case it would be the responsible for preventing the automatic failover to work. This relates to https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7559
As mentioned already, continue the investigation to understand why the load is, apparently, not evenly distributed across the read replicas. Investigate the application load balancing algorithm and code to understand if it is a cause.
Without reading all of the above, I think what might have happened is the following:
We have three secondaries: A, B, C. All secondaries handle an even amount of queries. Host A has to handle a lot of slow running queries, e.g. because somebody is crawling all projects using the API. As long as all hosts are up, A is sort of fine. C suddenly becomes available, so now A and B receive half the load of C. Because A is still handling a lot of slow queries, this may result in new queries having to wait for a connection to become available. This can then result in a lot of waiting clients. When C comes back online, the load will spread evenly across all hosts. A might still be running slow queries, but at least C is now available again to handle the other queries.
This would explain the graphs we saw, why it happened at least twice, and it matches the behaviour you'd expect to see from a user using a lot of (very) slow API calls.
From a load balancer perspective, we would have to look at additional statistics to determine if a host is available for queries. In the past we considered looking at CPU usage, but it was determined that this would be too difficult to obtain using a Rails process. Alternatively we might look at some other PG statistics to determine if a host is healthy enough for queries.
Failover was initiated due to Patroni timeout during its get_cluster call to the local Consul agent.
After failover, pg_rewind failed to convert the former primary node into a replica, due to statement_timeout.
In both cases, log events preceding the failover lead us to suspect that either a network or storage problem could have caused these timeouts, leading to the failover event.