Switch traffic off pgbouncer-03 and restart to fix pgbouncer_exporter
C3
Production Change - Criticality 3Change Objective | Switch traffic off pgbouncer-03 and restart to fix pgbouncer_exporter to restore metrics |
---|---|
Change Type | Operation |
Services Impacted | pgbouncer |
Change Team Members | @cmiskell |
Change Severity | C3 |
Change Reviewer or tested in staging | Review, and failover process tested on staging 2019-11-07 00:19 UTC |
Due Date | 2019-11-07 01:15 UTC (engineer in UTC+1300 |
Time tracking | ~30 minutes |
-
On pgbouncer-03-db-gprd
stop the pgbouncer-leader-check service. This releases the consul lock on that node and stops the healthcheck, allowing the lock to be obtained on pgbouncer-01 (previously in standby) and the healthcheck to start up there, so that new connections to the ILB will be service by -01 and -02.sudo systemctl stop pgbouncer-leader-check
-
Wait for connections to pgbouncer-03-db-gprd
on port 6432 to go away, as processes cycle their DB connections and reconnect via the LB to the other active pgbouncers. It's not known with certainty how long this will take, but expecting something in the order of 30 minutes -
Watch connections with: sudo watch "netstat -anlp|grep 6432|wc -l"
and https://dashboards.gitlab.net/d/pgbouncer-main/pgbouncer-overview?orgId=1 (the numbers on this dashboard may be a bit suspect per the titles, but are sufficiently indicative for this purpose) -
Once all connections to pgbouncer-03-db-gprd
are gone, or there is no active pgbouncer traffic on port 6432 (any remaining connections are inactive and irrelevant), restart pgbouncer:sudo systemctl restart pgbouncer
- If this phase takes too long, consider looking at which machines still have connections and performing a manual cycle on them (drain/quiesce, restart/hup, return to service). Actual effort here will depend on the type of client node.
-
Start the pgbouncer-leader-check service on pgbouncer-03-db-gprd
again. This could be done earlier, and in the event of any issues onpgbouncer-01-db-gprd
orpgbouncer-02-db-gprd
it should be done immediately, but delaying until the end prevents some small hiccup causing failback to 03 requiring us to restart the entire process. -
Check pgbouncer logs on pgbouncer-03-db-gprd ( /var/log/promteheus/pgbouncer_exporter/current
) to verify that the repeated restarts/stacktraces have gone away.
Edited by Craig Miskell