Upgrade GitLab to 11.0.0 on all PostgreSQL and pgbouncer hosts
Currently the PostgreSQL and pgbouncer hosts are running GitLab 10.6.4 EE.
GitLab 11.0 introduced support for default_statistics_target
(gitlab-com/database#105 (closed)), but to make use of this we
need to upgrade the packages first.
If upgrading the package doesn't trigger a restart I think we can do this in-place, starting with all secondaries. However, based on past maintenance experiences we may want to perform a failover instead. Using an in-place update, the procedure would roughly be:
- Stop chef-client on all hosts
- Update the pinned version in Chef
- Download and install the package
- Run a reconfigure to make sure the PostgreSQL configuration files have the necessary settings
- Re-enable chef-client on all hosts once everything is upgraded
With a failover the procedure is the same, except we first stop pgbouncer, and do all the work on the secondaries. Once the secondaries are done, we fail over to a secondary, then upgrade the old primary and turn it into a secondary.
How can we best approach this?
Schedule
I propose we do this after 11.1 is released, for the following reasons:
- 11.1 includes Omnibus support for service discovery (gitlab-org/omnibus-gitlab!2610 (merged)). Deploying 11.1 right away, instead of 11.0, means we don't have to do this procedure twice in a short period.
- Andreas is currently on holidays and won't be back until the 12th, so doing it this month is probably too soon.
- We need to test this a bunch of times in staging. This ensures we have a better understanding of how this procedure will affect a running system, what additional steps we may need to run, etc.
Plan
Before everything
-
Reduce the repmgr failover timeout from 30 seconds to 5 seconds. This ensures a failover happens after 5 seconds of the primary not responding, instead of 30 seconds. Make sure this is applied using chef-client
andgitlab-ctl reconfigure
on all pgbouncer and postgres hosts - Disable chef-client on all database and pgbouncer hosts (https://gitlab.com/gitlab-com/runbooks/blob/master/howto/disable-chef-runs-on-a-vm.md)
-
postgres-01 -
postgres-02 (primary) -
postgres-03 -
postgres-04 -
pgbouncer-01 -
pgbouncer-02
- Verify that there are no remaining chef-client processes running on these hosts
-
postgres-01 -
postgres-02 (primary) -
postgres-03 -
postgres-04 -
pgbouncer-01 (primary, web) -
pgbouncer-02 (primary, sidekiq)
- Upgrade the package version in chef-client, push to master, don't run
chef-client
orreconfigure
For every secondary
-
postgres-01 -
sudo gitlab-ctl stop pgbouncer
⚠ This must be done before stopping PostgreSQL -
sudo gitlab-ctl stop postgresql
-
sudo chef-client
-
sudo gitlab-ctl reconfigure
-
Verify that /var/opt/gitlab/postgresql/data/runtime.conf
includesdefault_statistics_target = 1000
-
sudo gitlab-ctl start postgresql
. This may take a few minutes, and progress can be seen by runningsudo tail -f /var/log/gitlab/postgresql/current
-
Verify that PostgreSQL is working by running sudo gitlab-psql gitlabhq_production
, followed by running the querySELECT COUNT(*) FROM users
. This should produce a number around 2.3 million. -
sudo gitlab-ctl start pgbouncer
-
Make sure the host is receiving transactions by looking at https://performance.gitlab.net/dashboard/db/postgres-stats?panelId=5&fullscreen&orgId=1 -
postgres-03 -
sudo gitlab-ctl stop pgbouncer
⚠ This must be done before stopping PostgreSQL -
sudo gitlab-ctl stop postgresql
-
sudo chef-client
-
sudo gitlab-ctl reconfigure
-
Verify that /var/opt/gitlab/postgresql/data/runtime.conf
includesdefault_statistics_target = 1000
-
sudo gitlab-ctl start postgresql
. This may take a few minutes, and progress can be seen by runningsudo tail -f /var/log/gitlab/postgresql/current
-
Verify that PostgreSQL is working by running sudo gitlab-psql gitlabhq_production
, followed by running the querySELECT COUNT(*) FROM users
. This should produce a number around 2.3 million. -
sudo gitlab-ctl start pgbouncer
-
Make sure the host is receiving transactions by looking at https://performance.gitlab.net/dashboard/db/postgres-stats?panelId=5&fullscreen&orgId=1 -
postgres-04 -
sudo gitlab-ctl stop pgbouncer
⚠ This must be done before stopping PostgreSQL -
sudo gitlab-ctl stop postgresql
-
sudo chef-client
-
sudo gitlab-ctl reconfigure
-
Verify that /var/opt/gitlab/postgresql/data/runtime.conf
includesdefault_statistics_target = 1000
-
sudo gitlab-ctl start postgresql
. This may take a few minutes, and progress can be seen by runningsudo tail -f /var/log/gitlab/postgresql/current
-
Verify that PostgreSQL is working by running sudo gitlab-psql gitlabhq_production
, followed by running the querySELECT COUNT(*) FROM users
. This should produce a number around 2.3 million. -
sudo gitlab-ctl start pgbouncer
-
Make sure the host is receiving transactions by looking at https://performance.gitlab.net/dashboard/db/postgres-stats?panelId=5&fullscreen&orgId=1
For the primary
The primary is postgres-02. Instead of doing work in place, I propose we fail over, then turn the old primary into a secondary.
-
Stop sidekiq-cluster on all gitlab-base-be-sidekiq
nodes. This reduces the pressure on the primary, and means we have one piece of software less to worry about during the failover.
Then, on postgres-02 (the current primary):
-
sudo gitlab-ctl stop pgbouncer
-
sudo gitlab-ctl stop postgresql
After roughly 5 seconds, repmgr should fail over to a new host, and update pgbouncer-01 and pgbouncer-02 to point to this new host.
Once the new host has been confirmed to work, we need to take the following steps:
-
Note down the internal IP of the new primary -
Note down the internal IP of the old primary -
Remove this internal IP from the db_load_balancing
configuration of the rolesgitlab-base
andcanary-base
-
Turn the old primary into a proper secondary, and make sure it can run queries -
Add the internal IP of the old primary (now a secondary) to db_load_balancing
in rolesgitlab-base
andcanary-base
-
Make sure the new secondary is receiving transactions by looking at https://performance.gitlab.net/dashboard/db/postgres-stats?panelId=5&fullscreen&orgId=1
After everything
-
Make sure that WAL-E is using the right host for point in time recovery -
Make sure that any other backups we make (e.g. a base backup) use the new primary -
Stop the pgbouncer exporter @stanhu is running in a tmux session somewhere, since this will no longer be needed -
Restore the 30 seconds timeout for repmgr, and make sure this is applied using chef-client
andgitlab-ctl reconfigure