2022-08-13: GSTG Scale down the number of Patroni CI replicas
Production Change
Change Summary
Details in https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7524 .
Since we finished CI decomposition we are running too many replicas on Patroni Main and Patroni CI. We will remove them 1 at a time allowing for fast rollback by marking nodes as noloadbalance
and nofailover
for some time before actually shutting down the nodes.
This process will be based on https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/patroni/scale-down-patroni.md but adapted to be run 1 node at a time with gaps between nodes.
Change Details
- Services Impacted - ServicePatroniCI ServicePostgres
-
Change Technician -
@DylanGriffith
@rhenchen.gitlab - Change Reviewer - @gsgl
- Time tracking - unknown
- Downtime Component - none
Detailed steps for the change
Change Steps - steps to take to execute the change
Estimated Time to Complete (mins) - 3 days
-
Set label changein-progress /label ~change::in-progress
-
Choose a node you wish to remove and note it below: -
NODE: patroni-ci-2004-06-db-gstg.c.gitlab-staging-1.internal
-
The node must not be a primary gitlab-patronictl list
-
The node must be a normal replica. Therefore, must not have noloadbalance
nornofailover
-
-
Pull up the Host Stats Grafana dashboard and select the node you are planning to remove. This will help you monitor the host. -
Disable chef on the host: sudo chef-client-disable "Removing patroni node: Ref issue https://gitlab.com/gitlab-com/gl-infra/production/-/issues/7528"
-
Add a tags section to /var/opt/gitlab/patroni/patroni.yml
on the node:tags: nofailover: true noloadbalance: true
-
Reload Patroni config: sudo systemctl reload patroni
-
Test the efficacy of that reload by checking for the node name in the list of replicas. If the name is absent, then the reload worked. dig @127.0.0.1 -p 8600 ci-db-replica.service.consul. SRV
-
Wait until all client connections are drained from the replica (it depends on the interval value set for the clients), use this command to track number of client connections. It can take a few minutes until all connections are gone. If there are still a few connections on pgbouncers after 5m you can check if there are actually any active connections in the DB (should be 0 most of the time): for c in /usr/local/bin/pgb-console*; do $c -c 'SHOW CLIENTS;' | grep gitlabhq_production | grep -v gitlab-monitor; done | wc -l
gitlab-psql -qtc "SELECT count(*) FROM pg_stat_activity WHERE pid <> pg_backend_pid() AND datname = 'gitlabhq_production' AND state <> 'idle' AND usename <> 'gitlab-monitor' AND usename <> 'postgres_exporter';"
-
Confirm Patroni Metrics still looks healthy, especially for replica nodes -
Wait 3 working days to see a good amount of peak usage -
Confirm Patroni Metrics still looks healthy, especially for replica nodes -
Confirm CPU saturation across replicas remain below 70% peaks. Note: you should ignore the primary and backup nodes as they aren't relevant -
Silence alerts for Patroni going down on this node -
Create a merge request to remove this replica node from patroni-ci-2004
. MR: https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/4223 -
Checkout the branch for this merge request locally -
Run a plan to make sure the changes look right tf plan
-
Merge this merge request to decrease the count for patroni-ci-2004
-
Checkout master
locally andgit pull
the merged change -
Run a plan again to make sure the changes look right tf plan
-
Apply the changes tf apply
-
Wait until the node gets removed and torn down. (Validate it in GCP) -
Set label changecomplete /label ~change::complete
Rollback
Rollback steps - steps to be taken in the event of a need to rollback this change
To re-add a node to the replica pool if it's not shut down yet
Estimated Time to Complete (mins) - 10
-
Restart chef on the node sudo chef-client-disable
-
Re-run chef sudo chef-client
-
Set label changeaborted /label ~change::aborted
To re-create a node in the replica pool if it's already been deleted
- Follow the runbook https://gitlab.com/gitlab-com/runbooks/-/blob/master/docs/patroni/scale-up-patroni.md
Monitoring
Key metrics to observe
- Metric: Patroni CI
- Location: https://dashboards.gitlab.net/d/patroni-ci-main/patroni-ci-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gstg
- What changes to this metric should prompt a rollback: Saturation increases or errors or SLI dip particularly related to replicas
- Metric: Patroni Node - CPU Utilization
- Location: https://thanos-query.ops.gitlab.net/graph?g0.expr=avg(instance%3Anode_cpu_utilization%3Aratio%7Benv%3D%22gstg%22%2Cenvironment%3D%22gstg%22%2Ctype%3D%22patroni-ci%22%7D)%20by%20(fqdn)&g0.tab=0&g0.stacked=0&g0.range_input=6h&g0.max_source_resolution=0s&g0.deduplicate=1&g0.partial_response=0&g0.store_matches=%5B%5D
- What changes to this metric should prompt a rollback: CPU saturation peaks exceeding 70% frequently. Note you should hide/ignore the primary and backup nodes as they aren't relevant.
Rollback Thresholds
- Metric: Replica nodes CPU Load (processes per core)
- Location: node_load1
- What changes to this metric should prompt a rollback:
CPU Load Avg > 0.7
(per core) for 15 minutes or more;
- Metric: Replica nodes CPU Usage (% of all CPUs)
- Location: node_cpu_utilization
- What changes to this metric should prompt a rollback: avg
CPU utilization > 70%
for 15 minutes or more;
- Metric: Replica nodes Memory Trashing (Swap in/out)
- Location: node_vmstat_pswpin , node_vmstat_pswpout
- What changes to this metric should prompt a rollback: Spikes of
Swapping activity > 0
for 5 minutes or more;
- Metric: Replica nodes I/O wait
- Location: node_disk_read_time_seconds_total , node_disk_write_time_seconds_total
- What changes to this metric should prompt a rollback: avg
I/O wait > 10ms (or 0.01s)
for 2 minutes or more, but only if caused by an intense I/O activity;
- Metric: Replica nodes I/O Throughput in MB/s
- Location: /dev/sdb node_disk_read_bytes_total, /dev/sdb node_disk_written_bytes_total
- What changes to this metric should prompt a rollback:
I/O Throughput > 560 MB/s
, 70% of the limit 800 MB/s*, for 15 minutes or more;
- Metric: Replica nodes IOPS
- Location: /dev/sdb node_disk_reads_completed_total , /dev/sdb node_disk_writes_completed_total
- What changes to this metric should prompt a rollback: I/O operations per second
IOPS > 10500
, 70% of the limit of 15000 iops*, for 15 minutes or more;
- Metric: Writer nodes Network throughput
- Location: node_network_receive_bytes_total , node_network_transmit_bytes_total
- What changes to this metric should prompt a rollback: Sustained
Network Throughput > 11.2 Gbps (1.4 GB/s)
, 70% the VM limit of16 Gbps (2 GB/s)
*, for 15 minutes or more;
* Network and Storage I/O performance limits in GSTG
are based on SSD (performance) persistent disk
of 2.5 TBs and n1-standard-8
VM with 8 vCPUs, where the I/O bottleneck is the 8vCPU N1 machine type limits for pd-performance and not the block device limits
Change Reviewer checklist
-
Check if the following applies: - The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
-
Check if the following applies: - The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
- If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary
Change Technician checklist
-
Check if all items below are complete: - The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention
@sre-oncall
and this issue and await their acknowledgement.) - For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention
@release-managers
and this issue and await their acknowledgment.) - There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.