Skip to content

2022-08-13: GSTG Scale down the number of Patroni CI replicas

Production Change

Change Summary

Details in https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7524 .

Since we finished CI decomposition we are running too many replicas on Patroni Main and Patroni CI. We will remove them 1 at a time allowing for fast rollback by marking nodes as noloadbalance and nofailover for some time before actually shutting down the nodes.

This process will be based on https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/patroni/scale-down-patroni.md but adapted to be run 1 node at a time with gaps between nodes.

Change Details

  1. Services Impacted - ServicePatroniCI ServicePostgres
  2. Change Technician - @DylanGriffith @rhenchen.gitlab
  3. Change Reviewer - @gsgl
  4. Time tracking - unknown
  5. Downtime Component - none

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 3 days

  1. Set label changein-progress /label ~change::in-progress
  2. Choose a node you wish to remove and note it below:
    1. NODE: patroni-ci-2004-06-db-gstg.c.gitlab-staging-1.internal
    2. The node must not be a primary gitlab-patronictl list
    3. The node must be a normal replica. Therefore, must not have noloadbalance nor nofailover
  3. Pull up the Host Stats Grafana dashboard and select the node you are planning to remove. This will help you monitor the host.
  4. Disable chef on the host:
    sudo chef-client-disable "Removing patroni node: Ref issue https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7528"
  5. Add a tags section to /var/opt/gitlab/patroni/patroni.yml on the node:
    tags:
      nofailover: true
      noloadbalance: true
  6. Reload Patroni config:
    sudo systemctl reload patroni
  7. Test the efficacy of that reload by checking for the node name in the list of replicas. If the name is absent, then the reload worked.
    dig @127.0.0.1 -p 8600 ci-db-replica.service.consul. SRV
  8. Wait until all client connections are drained from the replica (it depends on the interval value set for the clients), use this command to track number of client connections. It can take a few minutes until all connections are gone. If there are still a few connections on pgbouncers after 5m you can check if there are actually any active connections in the DB (should be 0 most of the time):
    for c in /usr/local/bin/pgb-console*; do $c -c 'SHOW CLIENTS;' | grep gitlabhq_production | grep -v gitlab-monitor; done | wc -l
    gitlab-psql -qtc "SELECT count(*) FROM pg_stat_activity
    WHERE pid <> pg_backend_pid()
    AND datname = 'gitlabhq_production'
    AND state <> 'idle'
    AND usename <> 'gitlab-monitor'
    AND usename <> 'postgres_exporter';"
    NOTE: There is a known issue with the Rails processes not refreshing their load balancer DNS cache and this may account for delays in draining connections. If this still isn't fixed, you may need to wait until the next deployment for all Rails processes to be restarted to see all connections drained from the replica. The rails processes should technically be resilient to replicas going down but waiting until connections drain would be the safest option.
  9. Confirm Patroni Metrics still looks healthy, especially for replica nodes
  10. Wait 3 working days to see a good amount of peak usage
  11. Confirm Patroni Metrics still looks healthy, especially for replica nodes
  12. Confirm CPU saturation across replicas remain below 70% peaks. Note: you should ignore the primary and backup nodes as they aren't relevant
  13. Silence alerts for Patroni going down on this node
  14. Create a merge request to remove this replica node from patroni-ci-2004. MR: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/4223
  15. Checkout the branch for this merge request locally
  16. Run a plan to make sure the changes look right
    tf plan
  17. Merge this merge request to decrease the count for patroni-ci-2004
  18. Checkout master locally and git pull the merged change
  19. Run a plan again to make sure the changes look right
    tf plan
  20. Apply the changes
    tf apply
  21. Wait until the node gets removed and torn down. (Validate it in GCP)
  22. Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

To re-add a node to the replica pool if it's not shut down yet

Estimated Time to Complete (mins) - 10

  • Restart chef on the node
    sudo chef-client-disable
  • Re-run chef
    sudo chef-client
  • Set label changeaborted /label ~change::aborted

To re-create a node in the replica pool if it's already been deleted

  1. Follow the runbook https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/patroni/scale-up-patroni.md

Monitoring

Key metrics to observe

Rollback Thresholds

* Network and Storage I/O performance limits in GSTG are based on SSD (performance) persistent disk of 2.5 TBs and n1-standard-8 VM with 8 vCPUs, where the I/O bottleneck is the 8vCPU N1 machine type limits for pd-performance and not the block device limits

Change Reviewer checklist

C4 C3 C2 C1:

  • Check if the following applies:
    • The scheduled day and time of execution of the change is appropriate.
    • The change plan is technically accurate.
    • The change plan includes estimated timing values based on previous testing.
    • The change plan includes a viable rollback plan.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

  • Check if the following applies:
    • The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
    • The change plan includes success measures for all steps/milestones during the execution.
    • The change adequately minimizes risk within the environment/service.
    • The performance implications of executing the change are well-understood and documented.
    • The specified metrics/monitoring dashboards provide sufficient visibility for the change.
      • If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
    • The change has a primary and secondary SRE with knowledge of the details available during the change window.
    • The labels blocks deployments and/or blocks feature-flags are applied as necessary

Change Technician checklist

  • Check if all items below are complete:
    • The change plan is technically accurate.
    • This Change Issue is linked to the appropriate Issue and/or Epic
    • Change has been tested in staging and results noted in a comment on this issue.
    • A dry-run has been conducted and results noted in a comment on this issue.
    • For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
    • For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
    • For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
    • For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
    • Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
    • There are currently no active incidents that are severity1 or severity2
    • If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.
Edited by Rafael Henchen