Ensure confidence that we can run gitlab-ctl reconfigure on our database server (take 2)
Discovery
As discussed in this issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/4817
We've identified that passwords, at some point, were configured incorrectly. Now that our vaults are correct, let's hope we don't induce an outage by executing a gitlab-ctl reconfigure
on our pgbouncer and postgres boxes.
We've successfully tested bits of the below on gstg
Half of this work has already been completed in gprd
from issue #511 (closed). This issue is here to pick up where we left off.
Plan of Action
Proposal
-
Execute a test run of the reconfigure on pgbouncer-03 as such: /opt/gitlab/embedded/bin/chef-client -z -c /opt/gitlab/embedded/cookbooks/solo.rb -j /opt/gitlab/embedded/cookbooks/dna.json --why-run
- The end result, should indicate that there are no changes related to pgbouncer, it's configuration or authentication.
-
Execute gitlab-ctl reconfigure
on pgbouncer-03 -
Validate we can talk through pgbouncer-03
to postgres - those steps are highlighted below -
If there are no changes to pgbouncer authentication and we are successfully passing our validation steps, proceed with the below; otherwise,we need to roll the configuration to prevent service interruption. For this, proceed with the Mitigation plan highlighted below.
Continue
-
Execute agitlab-ctl reconfigure
on the rest of the pgbouncer nodes -
Ensure the site is continuing to work, validate sidekiq is still processing jobs -
Done
Mitigation Plan
-
Modify our role gprd-base-be-sidekiq
to pointdefault_attributes:omnibus-gitlab:gitlab_rb:gitlab-rails:db_host
to pgbouncer-03 -
Execute a chef run on ONE of the sidekiq nodes to validate we are error free - knife ssh 'sidekiq-pullmirror-01-sv-gprd.c.gitlab-production.internal' 'sudo chef-client && sudo gitlab-ctl hup unicorn'
-
Run chef on gprd-base-be-sidekiq
-
Proceed with running a gitlab-ctl hup unicorn
on all sidekiq servers -roles:gprd-base-be-sidekiq
-
Validate pgbouncer-02 is no longer seeing any traffic: -
ssh into pgbouncer-02 and execute sudo gitlab-ctl pgb-console
-
a show pools;
should indicate 0cl_active
on thegitlabhq_production
db for thegitlab
user
-
-
Proceed with running a gitlab-ctl reconfigure
on pgbouncer-02 -
Validate pgbouncer-02 is good to go - steps highlighted below -
Modify our role gprd-base
to pointdefault_attributes:omnibus-gitlab:gitlab_rb:gitlab-rails:db_host
to pgbouncer-02 -
AFTER chef has run in our entire environment -
perform a gitlab-ctl hup unicorn
on all web/api/git servers -roles:gprd-base-fe
-
Validate pgbouncer-01 is no longer seeing any traffic: -
ssh into pgbouncer-01 and execute sudo gitlab-ctl pgb-console
-
a show pools;
should indicate 0cl_active
on thegitlabhq_production
db for thegitlab
user
-
-
Proceed with running a gitlab-ctl reconfigure
on pgbouncer-01 -
Validate pgbouncer-01 is good to go - steps highlighted below -
Done
Validation Steps
-
On pgbouncer-0X
, run psql commands as thegitlab
user trying to connect to the gitlab production database/opt/gitlab/embedded/bin/psql -h localhost -p 6432 -d gitlabhq_production -U gitlab
-
Validate the databases.ini file has the primary database in the configuration /var/opt/gitlab/consul/databases.ini
-
Validate that we can successfully connect to the pgbouncer console gitlab-ctl pgb-console
After Completion
-
Validate no errors in /var/log/gitlab/pgbouncer/current
-
Validate no errors in /var/log/gitlab/consul/failover_pgbouncer.log
-
Monitor Postgresql Overview - Specifically the PGbouncer section -
Monitor Sentry for DB related issue
Potential End Result
If the Mitigation Plan was followed:
-
pgbouncer-02
will take traffic from our web/api/etc nodes -
pgbouncer-03
will take traffic from our sidekiq nodes -
pgbouncer-01
will seemingly not take any traffic
Rollback
-
We'll utilize the Mitigation Procedure
Edited by John Skarbek