Check for excessive puma restarts
Adding a new check
How many puma restarts is too many? That is the biggest question here, so I've gone with a starting answer of "2 times per minute". I think we could actually go lower, to 1 time per minute, so this is a somewhat conservative start.
I do the calculation by looking at the first and last lines of the current puma_stdout.log, converting the timestamps to seconds, and calculating the duration of the log file, divided by 60 to get minutes instead of seconds. Then I grep out how many times we see "Sending TERM" in that same log file and divide by the number of minutes. Since this is bash math it's giving an integer value, effectively throwing away anything past the decimal. Since we're hand-waving here already I think that's a reasonable approach.
Closes #165
Verification steps for review
I took a set of test data from a customer's system where we recommended tuning puma memory limits. With that puma_stdout.log on my test system I get this response when running this check:
spot --ssh-agent -v -p all_playbook.yml -u root -k ~/.ssh/id_rsa -e GITLAB_VERSION:17.3.5 -n "20251107_165 - run check"
spot v1.19.1-74e1afa-2025-08-29T17:32:47Z
[diana interview-instance.env-57c7bd43.gcp.gitlabsandbox.net:22] run task "20251107_165 - run check", commands: 1
[diana interview-instance.env-57c7bd43.gcp.gitlabsandbox.net:22] run command "Run the check and store result"
[diana interview-instance.env-57c7bd43.gcp.gitlabsandbox.net:22] > sudo /bin/sh -c /tmp/.spot-2780404833747254272/spot-script3340370579
[diana interview-instance.env-57c7bd43.gcp.gitlabsandbox.net:22] > setvar JSON_DATA_20251107_165_all_diana={ "ref_url": "https://gitlab.com/gitlab-com/support/toolbox/gitlab-detective/-/issues/165",
"title": "Check for high puma restart frequency", "host": "interview-instance.env-57c7bd43.gcp.gitlabsandbox.net",
"workaround_url": "https://docs.gitlab.com/administration/operations/puma/#reducing-memory-use", "version_started": "14.0.0",
"version_fixed": null, "message": "Your system is showing signs of frequent puma worker restarts due to hitting configured memory
limits. This can result in perceived performance degradation on the system. You can alleviate this problem by tuning the puma
max memory configuration." }
[diana interview-instance.env-57c7bd43.gcp.gitlabsandbox.net:22] completed command "Run the check and store result" {script: /bin/sh -c [multiline script]} (3.501s)
[diana interview-instance.env-57c7bd43.gcp.gitlabsandbox.net:22] completed task "20251107_165 - run check", commands: 1 (4.444s)
With the boring little-to-no-activity log on my test system, I get this response:
spot --ssh-agent -v -p all_playbook.yml -u root -k ~/.ssh/id_rsa -e GITLAB_VERSION:17.3.5 -n "20251107_165 - run check"
spot v1.19.1-74e1afa-2025-08-29T17:32:47Z
[diana interview-instance.env-57c7bd43.gcp.gitlabsandbox.net:22] run task "20251107_165 - run check", commands: 1
[diana interview-instance.env-57c7bd43.gcp.gitlabsandbox.net:22] run command "Run the check and store result"
[diana interview-instance.env-57c7bd43.gcp.gitlabsandbox.net:22] > sudo /bin/sh -c /tmp/.spot-6044029276355193856/spot-script2584336099
[diana interview-instance.env-57c7bd43.gcp.gitlabsandbox.net:22] > setvar JSON_DATA_20251107_165_all_diana={ "ref_url": "https://gitlab.com/gitlab-com/support/toolbox/gitlab-detective/-/issues/165",
"title": "Check for high puma restart frequency", "host": "interview-instance.env-57c7bd43.gcp.gitlabsandbox.net",
"workaround_url": "https://docs.gitlab.com/administration/operations/puma/#reducing-memory-use", "version_started": "14.0.0",
"version_fixed": null }
[diana interview-instance.env-57c7bd43.gcp.gitlabsandbox.net:22] completed command "Run the check and store result" {script: /bin/sh -c [multiline script]} (3.451s)
[diana interview-instance.env-57c7bd43.gcp.gitlabsandbox.net:22] completed task "20251107_165 - run check", commands: 1 (4.126s)
Author checklist
- After opening the MR:
-
Set it to the current milestone -
Ask the Maintainer from the Reviewer roulettesuggestion for review
-
Reviewer checklist
-
I followed the verification steps and confirm the functionality of the new check -
I executed the check as presented in this MR by running the generated playbook with spot - In case of unexpected/odd behavior here, verify the generated playbook to account for potential YAML parsing issues
-
-
This check does only perform read operations -
This check does not output more than necessary on stdout for the check to function -
The messageexplains what it means when this check does not pass -
The workaround_urlprovides actionable information/steps for affected users- Consider if a Knowledge Base article should exist to serve as the ideal workaround URL
-
This check is not using the Rails console/runner, or has Maintainer approval for doing so -
This check is not using a Rake task, or has Maintainer approval for doing so -
If this is a breaking change check: -
It has the corresponding xx_breaking_changestag (xx being the major release version for the change) -
The workaround_urlgoes to the entry on the https://docs.gitlab.com/update/deprecations/ page -
The ref_urlgoes to the deprecation issue linked from that entry -
The titleis the same as that entry -
The version_startedis equal to theannouncement_milestoneof the deprecation -
The version_fixedis equal to theremoval_milestoneof the deprecation
-