2022-08-13: GSTG Scale down the number of Patroni CI replicas

Production Change

Change Summary

Details in https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7524 .

Since we finished CI decomposition we are running too many replicas on Patroni Main and Patroni CI. We will remove them 1 at a time allowing for fast rollback by marking nodes as noloadbalance and nofailover for some time before actually shutting down the nodes.

This process will be based on https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/patroni/scale-down-patroni.md but adapted to be run 1 node at a time with gaps between nodes.

Change Details

Services Impacted - ServicePatroniCI ServicePostgres
Change Technician - @DylanGriffith @rhenchen.gitlab
Change Reviewer - @gsgl
Time tracking - unknown
Downtime Component - none

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 3 days

Set label changein-progress /label ~change::in-progress
Choose a node you wish to remove and note it below:
1. NODE: patroni-ci-2004-06-db-gstg.c.gitlab-staging-1.internal
2. The node must not be a primary gitlab-patronictl list
3. The node must be a normal replica. Therefore, must not have noloadbalance nor nofailover
Pull up the Host Stats Grafana dashboard and select the node you are planning to remove. This will help you monitor the host.

Disable chef on the host:

sudo chef-client-disable "Removing patroni node: Ref issue https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7528"

Add a tags section to /var/opt/gitlab/patroni/patroni.yml on the node:
```
tags:
  nofailover: true
  noloadbalance: true
```
Reload Patroni config:
```
sudo systemctl reload patroni
```
Test the efficacy of that reload by checking for the node name in the list of replicas. If the name is absent, then the reload worked.
```
dig @127.0.0.1 -p 8600 ci-db-replica.service.consul. SRV
```
Wait until all client connections are drained from the replica (it depends on the interval value set for the clients), use this command to track number of client connections. It can take a few minutes until all connections are gone. If there are still a few connections on pgbouncers after 5m you can check if there are actually any active connections in the DB (should be 0 most of the time):
```
for c in /usr/local/bin/pgb-console*; do $c -c 'SHOW CLIENTS;' | grep gitlabhq_production | grep -v gitlab-monitor; done | wc -l
```
```
gitlab-psql -qtc "SELECT count(*) FROM pg_stat_activity
WHERE pid <> pg_backend_pid()
AND datname = 'gitlabhq_production'
AND state <> 'idle'
AND usename <> 'gitlab-monitor'
AND usename <> 'postgres_exporter';"
```
NOTE: There is a known issue with the Rails processes not refreshing their load balancer DNS cache and this may account for delays in draining connections. If this still isn't fixed, you may need to wait until the next deployment for all Rails processes to be restarted to see all connections drained from the replica. The rails processes should technically be resilient to replicas going down but waiting until connections drain would be the safest option.
Confirm Patroni Metrics still looks healthy, especially for replica nodes
Wait 3 working days to see a good amount of peak usage
Confirm Patroni Metrics still looks healthy, especially for replica nodes
Confirm CPU saturation across replicas remain below 70% peaks. Note: you should ignore the primary and backup nodes as they aren't relevant
Silence alerts for Patroni going down on this node
Create a merge request to remove this replica node from patroni-ci-2004. MR: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/4223
Checkout the branch for this merge request locally
Run a plan to make sure the changes look right
```
tf plan
```
Merge this merge request to decrease the count for patroni-ci-2004
Checkout master locally and git pull the merged change
Run a plan again to make sure the changes look right
```
tf plan
```
Apply the changes
```
tf apply
```
Wait until the node gets removed and torn down. (Validate it in GCP)
Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

To re-add a node to the replica pool if it's not shut down yet

Estimated Time to Complete (mins) - 10

Restart chef on the node
```
sudo chef-client-disable
```
Re-run chef
```
sudo chef-client
```
Set label changeaborted /label ~change::aborted

To re-create a node in the replica pool if it's already been deleted

Follow the runbook https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/patroni/scale-up-patroni.md

Monitoring

Key metrics to observe

Metric: Patroni CI
- Location: https://dashboards.gitlab.net/d/patroni-ci-main/patroni-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg
- What changes to this metric should prompt a rollback: Saturation increases or errors or SLI dip particularly related to replicas
Metric: Patroni Node - CPU Utilization
- Location: https://thanos-query.ops.gitlab.net/graph?g0.expr=avg(instance%3Anode_cpu_utilization%3Aratio%7Benv%3D%22gstg%22%2Cenvironment%3D%22gstg%22%2Ctype%3D%22patroni-ci%22%7D)%20by%20(fqdn)&g0.tab=0&g0.stacked=0&g0.range_input=6h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
- What changes to this metric should prompt a rollback: CPU saturation peaks exceeding 70% frequently. Note you should hide/ignore the primary and backup nodes as they aren't relevant.

Rollback Thresholds

Metric: Replica nodes CPU Load (processes per core)
- Location: node_load1
- What changes to this metric should prompt a rollback: CPU Load Avg > 0.7 (per core) for 15 minutes or more;
Metric: Replica nodes CPU Usage (% of all CPUs)
- Location: node_cpu_utilization
- What changes to this metric should prompt a rollback: avg CPU utilization > 70% for 15 minutes or more;
Metric: Replica nodes Memory Trashing (Swap in/out)
- Location: node_vmstat_pswpin , node_vmstat_pswpout
- What changes to this metric should prompt a rollback: Spikes of Swapping activity > 0 for 5 minutes or more;
Metric: Replica nodes I/O wait
- Location: node_disk_read_time_seconds_total , node_disk_write_time_seconds_total
- What changes to this metric should prompt a rollback: avg I/O wait > 10ms (or 0.01s) for 2 minutes or more, but only if caused by an intense I/O activity;
Metric: Replica nodes I/O Throughput in MB/s
- Location: /dev/sdb node_disk_read_bytes_total, /dev/sdb node_disk_written_bytes_total
- What changes to this metric should prompt a rollback: I/O Throughput > 560 MB/s, 70% of the limit 800 MB/s*, for 15 minutes or more;
Metric: Replica nodes IOPS
- Location: /dev/sdb node_disk_reads_completed_total , /dev/sdb node_disk_writes_completed_total
- What changes to this metric should prompt a rollback: I/O operations per second IOPS > 10500, 70% of the limit of 15000 iops*, for 15 minutes or more;
Metric: Writer nodes Network throughput
- Location: node_network_receive_bytes_total , node_network_transmit_bytes_total
- What changes to this metric should prompt a rollback: Sustained Network Throughput > 11.2 Gbps (1.4 GB/s), 70% the VM limit of 16 Gbps (2 GB/s)*, for 15 minutes or more;

* Network and Storage I/O performance limits in GSTG are based on SSD (performance) persistent disk of 2.5 TBs and n1-standard-8 VM with 8 vCPUs, where the I/O bottleneck is the 8vCPU N1 machine type limits for pd-performance and not the block device limits

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited Sep 16, 2022 by Rafael Henchen