Upgrade GitLab to 11.0.0 on all PostgreSQL and pgbouncer hosts

Currently the PostgreSQL and pgbouncer hosts are running GitLab 10.6.4 EE. GitLab 11.0 introduced support for default_statistics_target (gitlab-com/database#105 (closed)), but to make use of this we need to upgrade the packages first.

If upgrading the package doesn't trigger a restart I think we can do this in-place, starting with all secondaries. However, based on past maintenance experiences we may want to perform a failover instead. Using an in-place update, the procedure would roughly be:

Stop chef-client on all hosts
Update the pinned version in Chef
Download and install the package
Run a reconfigure to make sure the PostgreSQL configuration files have the necessary settings
Re-enable chef-client on all hosts once everything is upgraded

With a failover the procedure is the same, except we first stop pgbouncer, and do all the work on the secondaries. Once the secondaries are done, we fail over to a secondary, then upgrade the old primary and turn it into a secondary.

How can we best approach this?

Schedule

I propose we do this after 11.1 is released, for the following reasons:

11.1 includes Omnibus support for service discovery (gitlab-org/omnibus-gitlab!2610 (merged)). Deploying 11.1 right away, instead of 11.0, means we don't have to do this procedure twice in a short period.
Andreas is currently on holidays and won't be back until the 12th, so doing it this month is probably too soon.
We need to test this a bunch of times in staging. This ensures we have a better understanding of how this procedure will affect a running system, what additional steps we may need to run, etc.

Plan

Before everything

Reduce the repmgr failover timeout from 30 seconds to 5 seconds. This ensures a failover happens after 5 seconds of the primary not responding, instead of 30 seconds. Make sure this is applied using chef-client and gitlab-ctl reconfigure on all pgbouncer and postgres hosts
Disable chef-client on all database and pgbouncer hosts (https://gitlab.com/gitlab-com/runbooks/blob/master/howto/disable-chef-runs-on-a-vm.md)

Verify that there are no remaining chef-client processes running on these hosts

Upgrade the package version in chef-client, push to master, don't run chef-client or reconfigure

For every secondary

For the primary

The primary is postgres-02. Instead of doing work in place, I propose we fail over, then turn the old primary into a secondary.

Stop sidekiq-cluster on all gitlab-base-be-sidekiq nodes. This reduces the pressure on the primary, and means we have one piece of software less to worry about during the failover.

Then, on postgres-02 (the current primary):

sudo gitlab-ctl stop pgbouncer
sudo gitlab-ctl stop postgresql

After roughly 5 seconds, repmgr should fail over to a new host, and update pgbouncer-01 and pgbouncer-02 to point to this new host.

Once the new host has been confirmed to work, we need to take the following steps:

Note down the internal IP of the new primary
Note down the internal IP of the old primary
Remove this internal IP from the db_load_balancing configuration of the roles gitlab-base and canary-base
Turn the old primary into a proper secondary, and make sure it can run queries
Add the internal IP of the old primary (now a secondary) to db_load_balancing in roles gitlab-base and canary-base
Make sure the new secondary is receiving transactions by looking at https://performance.gitlab.net/dashboard/db/postgres-stats?panelId=5&fullscreen&orgId=1

After everything

Make sure that WAL-E is using the right host for point in time recovery
Make sure that any other backups we make (e.g. a base backup) use the new primary
Stop the pgbouncer exporter @stanhu is running in a tmux session somewhere, since this will no longer be needed
Restore the 30 seconds timeout for repmgr, and make sure this is applied using chef-client and gitlab-ctl reconfigure

cc @jarv @ibaum @glopezfernandez

Edited Jul 02, 2018 by Yorick Peterse