Rollout LOAD_BALANCER_PARALLEL_DISCONNECT environment variable in staging and test by removing a staging database replica

Staging Change

Change Summary

With Fix service discovery refresh-hosts delay (gitlab-org/gitlab!130432 - merged) we aim to fix Reduce time for service discovery to update aft... (gitlab-org/gitlab#423382 - closed).

This fix should reduce the time it takes for the traffic to removed replicas to stop, which will speed up the process of the scheduled PG 14 upgrade, as the last time it took 40 minutes for the traffic to drain.

As this is a change to our load-balancing system, and has the potential to impact many services, we've put it behind an ENV variable so that we can roll it out incrementally.

We did not use feature flag, as this is a low-level code change, and the code to check feature flags could be also affected.

Testing this in staging requires changing our list of database load balancing hosts in staging.

Using an environment variable, combined with testing this change in staging first, gives a low-risk way to verify that the code change performs as expected without impacting the production environment.

Change Details

Services Impacted - ServiceWeb ServiceAPI ServiceGitLab Rails
Change Technician - @krasio @DylanGriffith
Change Reviewer - @ahanselka @stomlinson
Time tracking - 1 hour
Downtime Component - none

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 40 minutes. (20 minutes for deploys + 10 minutes for each of two load balancer change events)

Set label changein-progress /label ~change::in-progress
Apply to staging:
- Merge the MR that adds the environment variable to gstg - gitlab-com/gl-infra/k8s-workloads/gitlab-com!3017 (merged).
- Wait for the deployment to complete.
- Remove a replica database from load balancing
  - Check current list of patroni replicas without nofailover, noloadbalance tags set to true
```
ssh patroni-main-v14-101-db-gstg.c.gitlab-staging-1.internal "sudo gitlab-patronictl list"
```
  - Login to patroni-main-v14-104-db-gstg.c.gitlab-staging-1.internal patroni replia and perform the following steps
  - Disable chef to remove it from the load balancer:
```
sudo chef-client-disable "Remove a replica from load balancer - CR https://gitlab.com/gitlab-com/gl-infra/production/-/issues/16300"
```
  - Add a tags section to /var/opt/gitlab/patroni/patroni.yml :
```
tags:
  nofailover: true
  noloadbalance: true
```
  - Reload Patroni config:
```
sudo systemctl reload patroni
```
  - Confirm the above replica has nofailover, noloadbalance tags set to true
```
sudo gitlab-patronictl list
```
  - Test the efficacy of that reload by checking for the node name in the list of replicas. If the name is absent, then the reload worked.
```
dig @127.0.0.1 -p 8600 db-replica.service.consul. SRV +short | awk '{print $4}' | sort | uniq -i
```
- Verify that traffic is drained from the removed replica in <= 10 minutes (2 minutes of disconnect time + 10 s * 6 replicas of force disconnect time + 5 minutes for the dns change to propegate + 2 minutes for service discovery to pick up the dns change)
  - https://dashboards.gitlab.net/goto/eOhFwLkSg?orgId=1 should show the transactions-per-second of the removed host dropping to zero.
- Add the replica back for database load balancing
  - Login to patroni-main-v14-104-db-gstg.c.gitlab-staging-1.internal patroni replia and perform the following steps
  - Enable chef :
```
sudo chef-client-enable
```
  - Re-run chef on the node :
```
sudo chef-client
```
  - Reload Patroni config:
```
sudo systemctl reload patroni
```
  - Confirm the above replica does not have nofailover, noloadbalance tags set to true
```
sudo gitlab-patronictl list
```
  - Test the efficacy of reload by checking for the node name in the list of replicas. If the name is present, then the reload worked.
```
dig @127.0.0.1 -p 8600 db-replica.service.consul. SRV +short | awk '{print $4}' | sort | uniq -i
```
- Verify that traffic is added back to the replica in <= 10 minutes
  - https://dashboards.gitlab.net/goto/eOhFwLkSg?orgId=1 should show the transactions-per-second of the re-added host increasing back to their previous value.
Set label changecomplete /label ~change::complete.

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - 15 minutes. (5 minutes to propegate the dns change, 5 minutes to merge the revert MR, 5 minutes to restart all web and sidekiq processes)

Add the removed database host back if it was removed
- Login to patroni-main-v14-104-db-gstg.c.gitlab-staging-1.internal patroni replia and perform the following steps
- Enable chef :
```
sudo chef-client-enable
```
- Re-run chef on the node :
```
sudo chef-client
```
- Reload Patroni config:
```
sudo systemctl reload patroni
```
- Confirm the above replica does not have nofailover, noloadbalance tags set to true
```
sudo gitlab-patronictl list
```
- Test the efficacy of reload by checking for the node name in the list of replicas. If the name is present, then the reload worked.
```
dig @127.0.0.1 -p 8600 db-replica.service.consul. SRV +short | awk '{print $4}' | sort | uniq -i
```
Revert the MR that addes the environment variable to gstg - gitlab-com/gl-infra/k8s-workloads/gitlab-com!3017 (merged).
Restart web and sidekiq processes to both pick up the change to environment variable, and to discover the re-added database host.
Verify that the transactions-per-second of the re-added host increase back to their previous value.
- https://dashboards.gitlab.net/goto/eOhFwLkSg?orgId=1
Set label changeaborted /label ~change::aborted

Monitoring

How to know if it worked

Traffic on the removed replica should drop close to zero once removed

We can watch traffic (transactions per second) to each staging database replica via https://dashboards.gitlab.net/goto/eOhFwLkSg?orgId=1. We should see it drop close to zero for the removed replica during the change.

We won't see traffic drop all the way to zero, because other processes, such as postgres-exporter, will continue to send a small number of queries to the database.

host_list_update events

We currently log host_list_update events. This should continue to happen as before.

host_list_update logs for staging

host_list_disconnection events

With gitlab-org/gitlab!130432 (merged) we will also start logging host_list_disconnection details, which should happen right after host_list_update events. total_disconnect_duration_s for these should be no more than 3 minutes (2 minutes of gentle_disconnect_duration_s + up to 10s per database host (6) of force_disconnect_duration_s.

host_list_disconnection logs for staging

Key metrics to observe

Metric: Number of connections from Rails -> PGBouncer
- Location:
  - staging
  - production
- What changes to this metric should prompt a rollback: Sudden unexpected change in total connections there (like if they went to 0 it could indicate service discovery is broken).
Metric: rails_primary_sql SLI Apdex
- Location: https://dashboards.gitlab.net/d/patroni-main/patroni-overview?orgId=1&viewPanel=2409561530&from=now-6h&to=now
- What changes to this metric should prompt a rollback: Any drop in apdex
Metric: PostgreSQL Overview dashboard
- Location: https://dashboards.gitlab.net/d/000000144/postgresql-overview?orgId=1&from=now-6h&to=now
- What changes to this metric should prompt a rollback: Any deviation from the normal state, aside from the expected reduction of traffic to the removed host.

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited Sep 05, 2023 by Simon Tomlinson