Use service discovery for DB load balancing

Production Change - Criticality 3 C3

Services Impacted

  • GitLab Rails
  • Sidekiq
  • Postgres

Change Team Members

Change Severity

C4

Buddy check or tested in staging

Tested on staging

Schedule of the change

March 21st, 2019 - 1 AM UTC

Duration of the change

30-60 minutes

Detailed steps for the change

The change is encapsulated in an Ansible playbook (ansible-migrations!1 (merged)) and is detailed (collapsed) at the bottom.

Relevant chef changes:

Command execution would be as follows:

workstation $ ssh -A bastion-01-inf-gprd.c.gitlab-production.internal
bastion     $ git clone git@gitlab.com:gitlab-com/gl-infra/ansible-migrations.git
bastion     $ cd ansible-migrations
bastion     $ consul catalog nodes | grep -E '^(api|git|web|sidekiq)' | awk '{ print $3 }' >> production-633/inventory.txt
bastion     $ export OPS_API_TOKEN=<token>
bastion     $ export MIGRATION_ENV=gprd
bastion     $ export ANSIBLE_HOST_KEY_CHECKING=False
bastion     $ ~/.local/bin/ansible-playbook -i production-633/inventory.txt -M ./modules/ -e @production-633/variables.yml production-633/playbook.yml
Steps execution
    • Pre-conditions: N/A
    • Step: Stop chef-client on the Patroni cluster
      • knife ssh roles:gprd-base-db-patroni 'sudo service chef-client stop'
    • Post-execution validation:
      • knife ssh roles:gprd-base-db-patroni 'ps aux | grep chef-client | grep -v grep' | wc -l
        • It should print "0"
    • Rollback:
      • knife ssh roles:gprd-base-db-patroni 'sudo service chef-client start'
    • Pre-conditions: N/A
    • Step: Switch on maintenance mode in the Patroni cluster
      • knife ssh roles:gprd-base-db-patroni 'gitlab-patronictl pause --wait'
    • Post-execution validation:
      • knife ssh roles:gprd-base-db-patroni 'gitlab-patronictl list 2>/dev/null | grep "Maintenance mode: on" | grep -v grep' | wc -l
        • It should print "6"
    • Rollback:
      • knife ssh roles:gprd-base-db-patroni 'gitlab-patronictl resume --wait'
    • Pre-conditions: N/A
    • Step: Merge and apply https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/474
    • Post-execution validation: N/A
    • Rollback: Revert https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/474, merge and apply
    • Pre-conditions:
      • ssh patroni-06-db-gprd.c.gitlab-production.internal 'gitlab-patronictl list | grep $(hostname)' | grep Leader | wc -l
        • It should print "0"
    • Step: Run chef-client on a single node
      • ssh patroni-06-db-gprd.c.gitlab-production.internal 'sudo chef-client && sudo systemctl restart consul'
    • Post-execution validation:
      • ssh patroni-06-db-gprd.c.gitlab-production.internal 'dig @localhost -p 8600 replica.patroni.service.consul. | grep 10.220.16.106 | wc -l'
        • It should print "1"
    • Rollback: N/A
    • Pre-conditions: N/A
    • Step: Rollout the changes to the rest of the cluster, one by one
      • knife ssh -C1 'roles:gprd-base-db-patroni -name:patroni-06-db-gprd.c.gitlab-production.internal' 'sudo chef-client && sudo systemctl restart consul'
    • Post-execution validation:
      • ssh patroni-06-db-gprd.c.gitlab-production.internal 'dig @localhost -p 8600 +short replica.patroni.service.consul. | wc -l'
        • It should print "5"
      • ssh patroni-06-db-gprd.c.gitlab-production.internal 'dig @localhost -p 8600 +short master.patroni.service.consul. | wc -l'
        • It should print "1"
    • Rollback: N/A
    • Pre-conditions: N/A
    • Step: Merge and apply https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/473
    • Post-execution validation: N/A
    • Rollback: Revert https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/473, merge and apply
    • Pre-conditions: N/A
    • Step: knife ssh roles:gprd-base 'sudo chef-client'
    • Post-execution validation:
      • ssh web-01-sv-gprd.c.gitlab-production.internal "sudo gitlab-rails r 'puts Gitlab::Database::LoadBalancing.proxy.load_balancer.instance_eval { @host_list }.hosts.map(&:host).count'"
        • It should print "5"
    • Rollback: N/A
Edited by Ahmad Sherif