Enable the `load_balancer_low_statement_timeout` feature flag

Production Change

Change Summary

Enable the load_balancer_low_statement_timeout feature flag in production. Feature flag rollout issue - gitlab-org/gitlab#473429

This feature flag, if enabled, will help to guard against failing disks on database replicas, such as the failure that caused 2024-09-10: Increased errors on GitLab.com (#18535 - closed).

~~There is currently a hard PCL - #18551 (closed) - but given this change could help if another disk fails, I'm proposing it as a possible change to execute during the PCL.~~ It was decided to wait until the PCL is over on Monday to execute this.

Change Details

Services Impacted - Rails web and sidekiq, main and ci database replicas.
Change Technician - @stomlinson
Change Reviewer - @mattkasa
Time tracking - 15 minutes
Downtime Component - No downtime

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Set label changein-progress /label ~change::in-progress
Enable the load_balancer_low_statement_timeout feature flag : /chatops run feature set load_balancer_low_statement_timeout true
Monitor the change for 15 minutes (see monitoring section)
Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Disable the load_balancer_low_statement_timeout feature flag: /chatops run feature set load_balancer_low_statement_timeout false
Set label changeaborted /label ~change::aborted

Monitoring

Key metrics to observe

We can monitor load balancing logs with https://log.gprd.gitlab.net/app/r/s/81nVV (excluding normal host_list_update and service_discovery_failure messages which happen every few minutes and when new pods boot but consul isn't ready yet, respectively).

We should also look for any abrupt shift in traffic from replicas to the primary, either for the main or ci databases. We can monitor that from patroni overview dashboards: main, ci

Any abrupt shift means we should turn off the flag.

Metric: Load balancer log messages
- Location: https://log.gprd.gitlab.net/app/r/s/81nVV
- What changes to this metric should prompt a rollback: A large number of messages indicating that replicas were excluded from load balancing. Seeing a single message is OK - we get about one per day that a replica gets excluded, usually due to transient network issues. But two or more, and especially a spike of dozens+ of messages.
Metric: Proportion of traffic going to replicas vs primary for main database
- Location: rails_primary_sql SLI RPS and rails_replica_sql SLI RPS in https://dashboards.gitlab.net/d/patroni-main/patroni3a-overview?orgId=1
- What should prompt a rollback: Any change outside of daily traffic fluctuation in the relative proportion of queries, especially more queries going to the primary and less to the replicas.
Metric: Proportion of traffic going to replicas vs primary for ci database
- Location: rails_primary_sql SLI RPS and rails_replica_sql SLI RPS in https://dashboards.gitlab.net/d/patroni-ci-main/patroni-ci3a-overview?orgId=1
- What should prompt a rollback: Any change outside of daily traffic fluctuation in the relative proportion of queries, especially more queries going to the primary and less to the replicas.
Metric: PgBouncer connection counts for primary database
- Location: Active connections per node section of PgBouncer connection pooling information in https://dashboards.gitlab.net/d/patroni-main/patroni3a-overview?orgId=1
- What should prompt a rollback: More than a 5% change in the number of connections.
Metric: PgBouncer connection counts for ci database
- Location: Active connections per node section of PgBouncer connection pooling information in https://dashboards.gitlab.net/d/patroni-ci-main/patroni-ci3a-overview?orgId=1
- What should prompt a rollback: More than a 5% change in the number of connections.

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited Sep 16, 2024 by Matt Kasa