Update to Postgres 9.6.8 in production
When 10.6 RC5 is done (c.f. gitlab-org/omnibus-gitlab!2356 (merged)) we should prioritise updating the production database to avoid a repeat of https://gitlab.com/gitlab-com/infrastructure/issues/3850
Omnibus has 9.6.8 already as of 10.6 so one strategy is to simply update the omnibus packages in production. Due to gitlab-org/omnibus-gitlab#3346 (closed) we can not simply update Omnibus on a live system and then restart the database later. Doing so will trigger errors while the updated package is in place until the restart happens. We could work around that by manually adding a symlink but I would prefer to just do rolling restarts of the servers.
In addition there's a known problem with the version of the HA scripts in the existing omnibus package. If pgbouncer is restarted on the pgbouncer node, as happens when gitlab-ctl reconfigure it will generate an empty databases.ini file which causes an outage. So I want to update the omnibus package there before I do this failover work. It's not strictly necessary as we believe HA itself will work in the existing version but it's a precarious situation that has caused two outages already.
So my plan is:
- Stop the chef service on the pgbouncer node
- Update the pgbouncer role with the 10.6 version of omnibus
- Manually run chef-client on the pgbouncer node
Then
- Stop chef on database nodes
- Update the database role to have the 10.6 version of omnibus
- Shut down database on postgres-04, run chef-client to update omnibus version on postgres-04, start up postgres-04.
- Repeat for postgres-03, postgres-02.
- Shut down postgres-01 which should cause a failover to postgres-0? automatically
- Run chef-client on on postgres-01 and make it a replica using
gitlab-ctl repmgr standby follow
Adding a runbook documenting this process: runbooks!544 (comment 67811012)
Dangers and questions:
- Need to do knowledge-transfer and practice exact commands needed to stop, change the chef configuration, and run chef
- The first chef run on postgres-04 may uncover problems updating omnibus in production
- In particular are there any configuration changes made in production that may be overwritten/
- The load-balancing when replicas shut down may not work properly (c.f. https://gitlab.com/gitlab-com/infrastructure/issues/3964)
- Shutting down postgres-01 may not cleanly fail over to another primary (c.f. https://gitlab.com/gitlab-com/infrastructure/issues/3512)
As a side note. We'll want to update to 10.7 soon so we get the jsonlog module (gitlab-org/omnibus-gitlab!2332 (closed)) and associated omnibus parameters. I think we should go ahead and do this this week to get more practice doing this anwyays even if we have to do it again in a few weeks.