Skip to content

Fix check for DISABLE_PUMA_WORKER_KILLER environment variable

Matthias Käppler requested to merge 334831-fix-puma-killer-env-check into master

What does this MR do and why?

See #365954 (closed)

This came out of an investigation into why we weren't seeing any memory killer events in SaaS. I described the issue here: #334831 (comment 966134364)

The Puma worker killer is a gem we use to reap Puma workers in case the process cluster runs above a given memory budget. It is meant to be enabled by default, but you can disable it via the DISABLE_PUMA_WORKER_KILLER environment variable.

Copying over the bug description:

In puma.rb, we check ENV['DISABLE_PUMA_WORKER_KILLER'] for truthiness -- that is not correct since environment variables are always strings and any string, even when empty, is truthy in Ruby:

irb(main):004:0> "" || "not set"
=> ""

This is not a problem as long as we don't set this environment at all (i.e. the expression evaluates to nil, which is falsey) or set it to something truthy (1, true, foo), but as soon as we set it to something that is (supposed to be) falsey (0, false), then the unless condition will still evaluate to true, which is what triggers this bug.

(🤞 I got my double-negatives right here)

Turns out we explicitly set this to false in our GitLab chart: https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com/blob/3322ba5ff67d3ff4996222610cd8966e66a5ab35/charts/gitlab/gprd/charts/gitlab/charts/webservice/values.yaml#L112

Therefore the worker killer is neither running for users of our chart by default, nor for SaaS.

I fixed this by using Gitlab::Utils.to_boolean on the string first. I also removed the double-negative ("unless disabled" -> "if enabled")

How to set up and validate locally

On master you can set DISABLE_PUMA_WORKER_KILLER=0 (or "false") to trigger the bug. This ought to enable it, but it doesn't. On this branch, it now should leave it enabled.

Risk considerations

As my analysis in #334831 (comment 964366171) shows, we sometimes run twice the RSS budget compared to what should be allowed given the current worker killer thresholds. However, these are figures when working at the extremes in SaaS; I have therefore decided to do two things to de-risk this fix:

  • Disable the PWK in chart-based deployments including SaaS. This is done via #364184 (closed).
  • Increase memory killer limits. I tested these limits against our gpt performance test suite here.

Related MRs:

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #334831 (closed)

Edited by Matthias Käppler

Merge request reports