Ensure confidence that we can run chef and gitlab-ctl without issues
Discovery
As discussed in this issue: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/4817
We've identified that passwords, at some point, were configured incorrectly. Now that our vaults are correct, let's hope we don't induce an outage by executing a gitlab-ctl reconfigure
on our pgbouncer and postgres boxes.
We've successfully tested bits of the below on gstg
Plan of Action
Proposal
-
Identify all postgres servers in the gprd environment:
knife search -i 'roles:gprd-base-db-postgres'
-
Identify the primary server by referencing the replication overview dashboard - primary:
postgres-03-db-gprd.c.gitlab-production.internal
- primary:
-
Execute a test run of the reconfigure on all postgresql nodes as such:
for h in $(knife search -i 'roles:gprd-base-db-postgres' 2>/dev/null); do
knife ssh "name:$h" "sudo /opt/gitlab/embedded/bin/chef-client -z -c /opt/gitlab/embedded/cookbooks/solo.rb -j /opt/gitlab/embedded/cookbooks/dna.json --why-run" | tee /var/tmp/chef-why-run-$h
done
-
Examine the resulting log files in /var/tmp for changes... -
execute gitlab-ctl reconfigure
on the postgresql primary node -
If nothing has changed, proceed to perform a gitlab-ctl reconfigure
on the rest of the postgresql nodes- If something did change related to pgbouncer, we are probably inducing an outage, go to the Panic Revert section below
-
Execute a test run of the reconfigure on pgbouncer-03 as such: /opt/gitlab/embedded/bin/chef-client -z -c /opt/gitlab/embedded/cookbooks/solo.rb -j /opt/gitlab/embedded/cookbooks/dna.json --why-run
- The end result, should indicate that there are no changes related to pgbouncer, it's configuration or authentication.
-
Execute gitlab-ctl reconfigure
on pgbouncer-03 -
Validate we can talk through pgbouncer-03
to postgres - those steps are highlighted below -
If there are no changes to pgbouncer authentication and we are successfully passing our validation steps, proceed with the below; otherwise, we need to roll the configuration to prevent an outage. For this, proceed with the Mitigation plan highlighted below.
Continue
-
Execute a gitlab-ctl reconfigure
on the rest of the pgbouncer nodes -
Ensure the site is continuing to work, validate sidekiq is still processing jobs -
Done
Panic Revert
-
You are here because something unexpected happend to the pgbouncer user when executing a reconfigure on a postgresql node -
More than likely, the password is no longer fine, so reset it -
Grab a psql shell on a postgresql node -
ALTER USER pgbouncer WITH PASSWORD '<password from 1Password>';
-
Monitor Postgresql Overview - Specifically the PGbouncer section -
Monitor Sentry for DB related issue -
Stop this maintenance work immediately.
Mitigation Plan
-
Modify our role gprd-base
to pointdefault_attributes:omnibus-gitlab:gitlab_rb:gitlab-rails:db_host
to pgbouncer-03 -
AFTER chef has run in our entire environment -
perform a gitlab-ctl hup unicorn
on all web/api/git servers -roles:gprd-base-fe
-
Proceed with running a gitlab-ctl reconfigure
on pgbouncer-01 -
Validate pgbouncer-01 is good to go - steps highlighted below -
Modify our role gprd-base-bd-sidekiq
to pointdefault_attributes:omnibus-gitlab:gitlab_rb:gitlab-rails:db_host
to pgbouncer-01 -
Proceed with running a gitlab-ctl hup unicorn
on all sidekiq servers -roles:gprd-base-be-sidekiq
-
Proceed with running a gitlab-ctl reconfigure
on pgbouncer-02 -
Validate pgbouncer-02 is good to go - steps highlighted below -
Done
Validation Steps
-
On pgbouncer-0X
, run psql commands as thegitlab
user trying to connect to the gitlab production database/opt/gitlab/embedded/bin/psql -h localhost -p 6432 -d gitlabhq_production -U gitlab
-
Validate that we can successfully connect to the pgbouncer console gitlab-ctl pgb-console
After Completion
-
Validate no errors in /var/log/gitlab/postgresql/current
-
Validate no errors in /var/log/gitlab/pgbouncer/current
-
Validate no errors in /var/log/gitlab/consul/failover_pgbouncer.log
-
Monitor Postgresql Overview - Specifically the PGbouncer section -
Monitor Sentry for DB related issue
Potential End Result
If the Mitigation Plan was followed:
-
pgbouncer-03
will take traffic from our web/api/etc nodes -
pgbouncer-01
will take traffic from our sidekiq nodes -
pgbouncer-02
will seemingly not take any traffic
Rollback
-
We'll utilize the Mitigation Procedure -
No need to roll back, we are simply ensuring we can run a gitlab-ctl reconfigure
on our infrastructure without fear.
Edited by John Skarbek