postgres-02 became master during GCP switchover
Reconfigure logs:
[2018-08-11T12:06:17+00:00] INFO: template[/var/opt/gitlab/postgresql/data/postgresql.conf] updated file contents /var/opt/gitlab/postgresql/data/postgresql.conf
[2018-08-11T12:06:17+00:00] INFO: template[/var/opt/gitlab/postgresql/data/postgresql.conf] sending run action to execute[reload postgresql] (immediate)
[2018-08-11T12:06:18+00:00] INFO: execute[reload postgresql] ran successfully
[2018-08-11T12:06:18+00:00] INFO: template[/var/opt/gitlab/postgresql/data/postgresql.conf] sending run action to execute[start postgresql] (immediate)
[2018-08-11T12:06:18+00:00] INFO: template[/var/opt/gitlab/postgresql/data/runtime.conf] backed up to /opt/gitlab/embedded/cookbooks/cache/backup/var/opt/gitlab/postgresql/data/runtime.conf.chef-201808111206
18.185695
[2018-08-11T12:06:18+00:00] INFO: template[/var/opt/gitlab/postgresql/data/runtime.conf] updated file contents /var/opt/gitlab/postgresql/data/runtime.conf
[2018-08-11T12:06:18+00:00] INFO: template[/var/opt/gitlab/postgresql/data/runtime.conf] sending run action to execute[reload postgresql] (immediate)
[2018-08-11T12:06:18+00:00] INFO: execute[reload postgresql] ran successfully
[2018-08-11T12:06:18+00:00] INFO: template[/var/opt/gitlab/postgresql/data/runtime.conf] sending run action to execute[start postgresql] (immediate)
[2018-08-11T12:06:19+00:00] INFO: file[/var/opt/gitlab/gitlab-rails/etc/database_geo.yml] backed up to /opt/gitlab/embedded/cookbooks/cache/backup/var/opt/gitlab/gitlab-rails/etc/database_geo.yml.chef-201808
11120619.295671
[2018-08-11T12:06:19+00:00] INFO: file[/var/opt/gitlab/gitlab-rails/etc/database_geo.yml] deleted file at /var/opt/gitlab/gitlab-rails/etc/database_geo.yml
[2018-08-11T12:06:19+00:00] INFO: templatesymlink[Create a gitlab.yml and create a symlink to Rails root] sending run action to execute[clear the gitlab-rails cache] (delayed)
[2018-08-11T12:06:19+00:00] INFO: Chef Run complete in 16.751320464 seconds
[2018-08-11T12:06:19+00:00] INFO: Running report handlers
[2018-08-11T12:06:19+00:00] INFO: Report handlers complete
Notice 10 minutes after the failover, we see that postgres-02 was promoted at 2018-08-11_12:19:51.75213:
PostgreSQL logs:
2018-08-11_12:18:50.22567 2018-08-11 12:18:50 GMT [34355]: [59-1] LOG: recovery restart point at 7C6A/D201E0D8
2018-08-11_12:18:50.22567 2018-08-11 12:18:50 GMT [34355]: [60-1] DETAIL: last completed transaction was at log time 2018-08-11 12:18:45.865363+00
2018-08-11_12:19:51.75213 2018-08-11 12:19:51 GMT [34351]: [7-1] LOG: received promote request
2018-08-11_12:19:51.75220 2018-08-11 12:19:51 GMT [34358]: [4-1] FATAL: terminating walreceiver process due to administrator command
2018-08-11_12:19:51.75285 2018-08-11 12:19:51 GMT [34351]: [8-1] LOG: invalid resource manager ID 48 at 7C6A/D83DF510
2018-08-11_12:19:51.75288 2018-08-11 12:19:51 GMT [34351]: [9-1] LOG: redo done at 7C6A/D83DF4E8
2018-08-11_12:19:51.75289 2018-08-11 12:19:51 GMT [34351]: [10-1] LOG: last completed transaction was at log time 2018-08-11 12:19:51.745188+00
2018-08-11_12:19:51.76387 2018-08-11 12:19:51 GMT [34351]: [11-1] LOG: selected new timeline ID: 7
2018-08-11_12:19:51.81128 2018-08-11 12:19:51 GMT [34351]: [12-1] LOG: archive recovery complete
2018-08-11_12:19:51.92634 2018-08-11 12:19:51 GMT [34351]: [13-1] LOG: MultiXact member wraparound protections are now enabled
2018-08-11_12:19:51.92655 2018-08-11 12:19:51 GMT [34355]: [61-1] LOG: checkpoint starting: force
2018-08-11_12:19:52.06219 2018-08-11 12:19:52 GMT [34326]: [5-1] LOG: database system is ready to accept connections
2018-08-11_12:19:52.06224 2018-08-11 12:19:52 GMT [30886]: [1-1] LOG: autovacuum launcher started
2018-08-11_12:19:53.60137 wal_e.worker.upload INFO MSG: begin archiving a file
2018-08-11_12:19:53.60141 DETAIL: Uploading "pg_xlog/0000000600007C6A000000D8.partial" to "s3://gitlab-dbprod-backups/postgres-02/wal_005/0000000600007C6A000000D8.partial.lzo".
2018-08-11_12:19:53.60143 STRUCTURED: time=2018-08-11T12:19:53.600980-00 pid=30889 action=push-wal key=s3://gitlab-dbprod-backups/postgres-02/wal_005/0000000600007C6A000000D8.partial.lzo prefix=postg
res-02/ seg=0000000600007C6A000000D8.partial state=begin
2018-08-11_12:19:54.62926 wal_e.worker.upload INFO MSG: completed archiving to a file
2018-08-11_12:19:54.62931 DETAIL: Archiving to "s3://gitlab-dbprod-backups/postgres-02/wal_005/0000000600007C6A000000D8.partial.lzo" complete at 11204.3KiB/s.
2018-08-11_12:19:54.62932 STRUCTURED: time=2018-08-11T12:19:54.628632-00 pid=30889 action=push-wal key=s3:
In the Consul logs around this time, we see a single entry that suggests Consul thinks a failed master occurred:
2018-08-11_11:03:10.61107 2018/08/11 11:03:10 [INFO] serf: EventMemberJoin: pgbouncer-01-db-gprd.c.gitlab-production.internal 10.217.4.2
2018-08-11_12:19:56.84713 2018/08/11 12:19:56 [ERR] agent: Failed to invoke watch handler '/opt/gitlab/bin/gitlab-ctl consul watchers handle-failed-master': exit statu
s 1
2018-08-11_12:20:16.60256 2018/08/11 12:20:16 [INFO] agent: Synced check 'service:postgresql'
2018-08-11_12:20:50.33579 2018/08/11 12:20:50 [INFO] agent: Synced check 'service:postgresql'
2018-08-11_12:22:31.32798 2018/08/11 12:22:31 [INFO] agent: Synced check 'service:postgresql'
2018-08-11_12:22:42.55837 2018/08/11 12:22:42 [INFO] agent: Synced check 'service:postgresql'
2018-08-11_12:22:53.76409 2018/08/11 12:22:53 [INFO] agent: Synced check 'service:postgresql'
2018-08-11_12:23:27.44530 2018/08/11 12:23:27 [INFO] agent: Synced check 'service:postgresql'
repmgrd logs on postgres-02 reflects this promotion:
2018-08-11_12:18:36.10089 ERROR: this cluster already has an active master server
2018-08-11_12:18:36.10175 [2018-08-11 12:18:36] [DEBUG] connecting to: 'host=postgres-02-db-gprd.c.gitlab-production.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr sslmode=prefer sslcompression=0
fallback_application_name='repmgr''
2018-08-11_12:18:36.11797 [2018-08-11 12:18:36] [INFO] retrieving node list for cluster 'gitlab_cluster'
2018-08-11_12:18:36.11939 [2018-08-11 12:18:36] [DEBUG] connecting to: 'host=postgres-03-db-gprd.c.gitlab-production.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr sslmode=prefer sslcompression=0
fallback_application_name='repmgr''
2018-08-11_12:18:36.14036 [2018-08-11 12:18:36] [DEBUG] get_master_connection(): current master node is 842347107
2018-08-11_12:18:36.14036 [2018-08-11 12:18:36] [NOTICE] Original master reappeared before this standby was promoted - no action taken
2018-08-11_12:18:38.14074 [2018-08-11 12:18:38] [DEBUG] standby check loop will terminate
2018-08-11_12:18:38.14075 [2018-08-11 12:18:38] [INFO] connecting to master node of cluster 'gitlab_cluster'
2018-08-11_12:18:38.14077 [2018-08-11 12:18:38] [INFO] retrieving node list for cluster 'gitlab_cluster'
2018-08-11_12:18:38.14162 [2018-08-11 12:18:38] [DEBUG] connecting to: 'host=postgres-03-db-gprd.c.gitlab-production.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr sslmode=prefer sslcompression=0
fallback_application_name='repmgr''
2018-08-11_12:18:38.16285 [2018-08-11 12:18:38] [DEBUG] get_master_connection(): current master node is 842347107
2018-08-11_12:18:38.16286 [2018-08-11 12:18:38] [INFO] checking cluster configuration with schema 'repmgr_gitlab_cluster'
2018-08-11_12:18:38.16384 [2018-08-11 12:18:38] [INFO] checking node 859202352 in cluster 'gitlab_cluster'
2018-08-11_12:18:38.16425 [2018-08-11 12:18:38] [INFO] reloading configuration file
2018-08-11_12:18:38.16431 [2018-08-11 12:18:38] [INFO] configuration has not changed
2018-08-11_12:18:38.16431 [2018-08-11 12:18:38] [INFO] starting continuous standby node monitoring
2018-08-11_12:18:38.41688 [2018-08-11 12:18:38] [DEBUG] connecting to: 'host=postgres-02.db.prd.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr fallback_application_name='repmgr''
2018-08-11_12:18:38.43598 [2018-08-11 12:18:38] [ERROR] connection to database failed: could not connect to server: Connection refused
2018-08-11_12:18:38.43599 Is the server running on host "postgres-02.db.prd.gitlab.com" (10.66.1.102) and accepting
2018-08-11_12:18:38.43599 TCP/IP connections on port 5432?
2018-08-11_12:18:38.43600
2018-08-11_12:18:38.43600 [2018-08-11 12:18:38] [ERROR] unable to connect to upstream node: could not connect to server: Connection refused
2018-08-11_12:18:38.43600 Is the server running on host "postgres-02.db.prd.gitlab.com" (10.66.1.102) and accepting
2018-08-11_12:18:38.43600 TCP/IP connections on port 5432?
2018-08-11_12:18:38.43600
2018-08-11_12:18:38.43600 [2018-08-11 12:18:38] [DEBUG] connecting to: 'host=postgres-02.db.prd.gitlab.com port=5432 user=gitlab_repmgr dbname=gitlab_repmgr fallback_application_name='repmgr''
2018-08-11_12:18:38.45547 [2018-08-11 12:18:38] [ERROR] connection to database failed: could not connect to server: Connection refused
2018-08-11_12:18:38.45548 Is the server running on host "postgres-02.db.prd.gitlab.com" (10.66.1.102) and accepting
2018-08-11_12:18:38.45548 TCP/IP connections on port 5432?
2018-08-11_12:18:38.45548
2018-08-11_12:18:38.47398 [2018-08-11 12:18:38] [WARNING] connection to upstream has been lost, trying to recover... 30 seconds before failover decision
2018-08-11_12:18:43.49327 [2018-08-11 12:18:43] [WARNING] connection to upstream has been lost, trying to recover... 25 seconds before failover decision
2018-08-11_12:18:48.51261 [2018-08-11 12:18:48] [WARNING] connection to upstream has been lost, trying to recover... 20 seconds before failover decision
2018-08-11_12:18:53.53239 [2018-08-11 12:18:53] [WARNING] connection to upstream has been lost, trying to recover... 15 seconds before failover decision
2018-08-11_12:18:58.55169 [2018-08-11 12:18:58] [WARNING] connection to upstream has been lost, trying to recover... 10 seconds before failover decision
2018-08-11_12:19:03.57152 [2018-08-11 12:19:03] [WARNING] connection to upstream has been lost, trying to recover... 5 seconds before failover decision
2018-08-11_12:19:08.59105 [2018-08-11 12:19:08] [ERROR] unable to reconnect to upstream (timeout 60 seconds)...
2018-08-11_12:19:08.59160 [2018-08-11 12:19:08] [DEBUG] failure detected on master node (842491233); attempting to promote a standby
2018-08-11_12:19:08.59211 [2018-08-11 12:19:08] [DEBUG] 5 active nodes registered
2018-08-11_12:19:08.59212 [2018-08-11 12:19:08] [DEBUG] node=842347107 conninfo="host=postgres-03-db-gprd.c.gitlab-production.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr sslmode=prefer sslcomp
ression=0" type=master
2018-08-11_12:19:08.59212 [2018-08-11 12:19:08] [DEBUG] node=859202352 conninfo="host=postgres-02-db-gprd.c.gitlab-production.internal port=5432 user=gitlab_repmgr dbname=gitlab_repmgr" type=standby
@