Disable Puma worker killer in resource managed environments
We run a piece of Ruby code called the puma worker killer in GitLab. It is a background thread that reaps worker processes if they run over a given RSS budget, or if some timer expires. We recently realized that this wasn't running in SaaS (or any Helm based deployment) due to a configuration bug (it was intended to run): #334831 (closed)
The question came up: does it even make sense to run this in a resource-controlled environment like k8s, where we define container and pod resource budgets anyway through k8s? Should we just be turning this off? I believe it dates from a time when SaaS was being deployed from Omnibus and running on VMs, where there was no supervision in place to control resource use.
From the feedback so far, the pros and cons are:
Pros:
- Dogfooding and behavioral parity with unmanaged envs like from-source or Omnibus
- Prevent situations where the container is being killed without warning due to load spikes (see comments in #364184 (comment 969403853))
Cons:
- Must be kept aligned with container resource budgets
- More configuration to manage
- It calculates memory use based on RSS, which is ill-suited for pre-fork servers where memory is shared between processes
- Can hide memory issues, since PWK regularly rolls over workers
Rollout
I decided to move ahead with this issue since there isn't enough evidence for PWK being useful enough in chart-based deployments in its current state (which does not mean we may not revive this in some other form), and it blocks !88928 (merged).
The rollout needs to be carefully orchestrated so that we do not accidentally enable the PWK in production with insufficient memory thresholds. The order I think should be:
Disable PWK in chart-based deployments
-
Disable PWK in the GitLab chart: gitlab-org/charts/gitlab!2605 (merged) -
Disable PWK in SaaS - Disable it in pre-production: gitlab-com/gl-infra/k8s-workloads/gitlab-com!1850 (merged)
- Disable it in production: gitlab-com/gl-infra/k8s-workloads/gitlab-com!1854 (merged)
Fix bug that caused it to be disabled even when it was enabled
This was broken out into #365954 (closed)
This should put us in a safer position to iterate on PWK or a potential replacement.