Disable NUMA-hinted foreground page migrations on redis-cache nodes

Production Change

Change Summary

Disable active NUMA balancing on the redis-cache hosts.

After upgrading the machine-type for the production redis-cache hosts, the hypervisor started advertising 2 NUMA nodes to the kernel. This implicitly activated the kernel's NUMA balancing behavior:

PRODUCTION PRIMARY-REDIS msmiley@redis-cache-02-db-gprd.c.gitlab-production.internal:~$ sysctl kernel.numa_balancing
kernel.numa_balancing = 1

Kernel's NUMA balancing feature attempts to migrate a process's pages to the NUMA node associated with the CPU socket where that process is running. (It does so by manipulating the page table entries of the process's virtual memory areas, to artificially inject page faults. The next time a thread of that process accesses the page, that triggers a page fault, allowing the kernel to observe the trend in remote page accesses and potentially do a foreground page migration while the thread waits.)

In many cases, this improves memory access latency, but in some cases, the overhead is unacceptable. Because the redis-server process's main thread ends up having to stall while kernel performs these page migrations, the CPU overhead of doing many migrations counts against the very limited CPU capacity of the redis main thread itself. Judging from last Friday's observations, this drives the redis main thread to CPU saturation during peak hours of the weekday workload. This overhead is not worth the benefit in this case.

We may later revisit other options (tuning NUMA migrations to scan slower, or switching machine-type to avoid having multiple NUMA nodes). But for now we will just disable these foreground migrations and accept the memory access latency differential of having to sometimes access pages from the NUMA node that is further from the socket running the redis main thread.

I will follow this up with a chef change to persist this setting, but for today (Sunday), I will make it a runtime-only change, to prevent a regression on Monday.

For details see the investigation issue: scalability#1889

Change Details

Services Impacted - ServiceRedis
Change Technician - @msmiley
Change Reviewer - @nnelson
Time tracking - 1 minute
Downtime Component - None

Detailed steps for the change

Change Steps - steps to take to execute the change

Estimated Time to Complete (mins) - 1 minute

Set label changein-progress /label ~change::in-progress

Verify the initial state is that the feature is enabled (1):

$ mussh -h redis-cache-{01..03}-db-gprd.c.gitlab-production.internal -c 'sysctl kernel.numa_balancing'

For each of the 3 redis-cache VMs, disable kernel's NUMA balancing behavior:
```
$ sudo sysctl -w kernel.numa_balancing=0
```

Verify the final state is that the feature is disabled (0):

$ mussh -h redis-cache-{01..03}-db-gprd.c.gitlab-production.internal -c 'sysctl kernel.numa_balancing'

Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

For each of the 3 redis-cache VMs, re-enable kernel's NUMA balancing behavior:
```
$ sudo sysctl -w kernel.numa_balancing=1
```
Set label changeaborted /label ~change::aborted

Monitoring

Key metrics to observe

Metric: Page fault rate
- Location: thanos query
- What changes to this metric should prompt a rollback: The page fault rate should decrease. The kernel's numa_balancing feature injects page faults, and the increase can be seen in the above graph, starting we switched to a machine-type that had multiple NUMA nodes.

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed (If needed! Cases include DB change) prior to change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited Aug 29, 2022 by Matt Smiley