Check for excessive puma restarts

Adding a new check

How many puma restarts is too many? That is the biggest question here, so I've gone with a starting answer of "2 times per minute". I think we could actually go lower, to 1 time per minute, so this is a somewhat conservative start.

I do the calculation by looking at the first and last lines of the current puma_stdout.log, converting the timestamps to seconds, and calculating the duration of the log file, divided by 60 to get minutes instead of seconds. Then I grep out how many times we see "Sending TERM" in that same log file and divide by the number of minutes. Since this is bash math it's giving an integer value, effectively throwing away anything past the decimal. Since we're hand-waving here already I think that's a reasonable approach.

Closes #165

Verification steps for review

I took a set of test data from a customer's system where we recommended tuning puma memory limits. With that puma_stdout.log on my test system I get this response when running this check:

spot --ssh-agent -v -p all_playbook.yml -u root -k ~/.ssh/id_rsa -e GITLAB_VERSION:17.3.5 -n "20251107_165 - run check"
spot v1.19.1-74e1afa-2025-08-29T17:32:47Z
[diana interview-instance.env-57c7bd43.gcp.gitlabsandbox.net:22] run task "20251107_165 - run check", commands: 1
[diana interview-instance.env-57c7bd43.gcp.gitlabsandbox.net:22] run command "Run the check and store result"
[diana interview-instance.env-57c7bd43.gcp.gitlabsandbox.net:22]  > sudo /bin/sh -c /tmp/.spot-2780404833747254272/spot-script3340370579
[diana interview-instance.env-57c7bd43.gcp.gitlabsandbox.net:22]  > setvar JSON_DATA_20251107_165_all_diana={ "ref_url": "https://gitlab.com/gitlab-com/support/toolbox/gitlab-detective/-/issues/165", 
  "title": "Check for high puma restart frequency", "host": "interview-instance.env-57c7bd43.gcp.gitlabsandbox.net", 
  "workaround_url": "https://docs.gitlab.com/administration/operations/puma/#reducing-memory-use", "version_started": "14.0.0", 
  "version_fixed": null, "message": "Your system is showing signs of frequent puma worker restarts due to hitting configured memory 
  limits. This can result in perceived performance degradation on the system. You can alleviate this problem by tuning the puma 
  max memory configuration." }
[diana interview-instance.env-57c7bd43.gcp.gitlabsandbox.net:22] completed command "Run the check and store result" {script: /bin/sh -c [multiline script]} (3.501s)
[diana interview-instance.env-57c7bd43.gcp.gitlabsandbox.net:22] completed task "20251107_165 - run check", commands: 1 (4.444s)

With the boring little-to-no-activity log on my test system, I get this response:

spot --ssh-agent -v -p all_playbook.yml -u root -k ~/.ssh/id_rsa -e GITLAB_VERSION:17.3.5 -n "20251107_165 - run check"
spot v1.19.1-74e1afa-2025-08-29T17:32:47Z
[diana interview-instance.env-57c7bd43.gcp.gitlabsandbox.net:22] run task "20251107_165 - run check", commands: 1
[diana interview-instance.env-57c7bd43.gcp.gitlabsandbox.net:22] run command "Run the check and store result"
[diana interview-instance.env-57c7bd43.gcp.gitlabsandbox.net:22]  > sudo /bin/sh -c /tmp/.spot-6044029276355193856/spot-script2584336099
[diana interview-instance.env-57c7bd43.gcp.gitlabsandbox.net:22]  > setvar JSON_DATA_20251107_165_all_diana={ "ref_url": "https://gitlab.com/gitlab-com/support/toolbox/gitlab-detective/-/issues/165", 
  "title": "Check for high puma restart frequency", "host": "interview-instance.env-57c7bd43.gcp.gitlabsandbox.net", 
  "workaround_url": "https://docs.gitlab.com/administration/operations/puma/#reducing-memory-use", "version_started": "14.0.0", 
  "version_fixed": null }
[diana interview-instance.env-57c7bd43.gcp.gitlabsandbox.net:22] completed command "Run the check and store result" {script: /bin/sh -c [multiline script]} (3.451s)
[diana interview-instance.env-57c7bd43.gcp.gitlabsandbox.net:22] completed task "20251107_165 - run check", commands: 1 (4.126s)

Author checklist

  • After opening the MR:
    • Set it to the current milestone
    • Ask the Maintainer from the Reviewer roulette suggestion for review

Reviewer checklist

  • I followed the verification steps and confirm the functionality of the new check
    • I executed the check as presented in this MR by running the generated playbook with spot
    • In case of unexpected/odd behavior here, verify the generated playbook to account for potential YAML parsing issues
  • This check does only perform read operations
  • This check does not output more than necessary on stdout for the check to function
  • The message explains what it means when this check does not pass
  • The workaround_url provides actionable information/steps for affected users
    • Consider if a Knowledge Base article should exist to serve as the ideal workaround URL
  • This check is not using the Rails console/runner, or has Maintainer approval for doing so
  • This check is not using a Rake task, or has Maintainer approval for doing so
  • If this is a breaking change check:
    • It has the corresponding xx_breaking_changes tag (xx being the major release version for the change)
    • The workaround_url goes to the entry on the https://docs.gitlab.com/update/deprecations/ page
    • The ref_url goes to the deprecation issue linked from that entry
    • The title is the same as that entry
    • The version_started is equal to the announcement_milestone of the deprecation
    • The version_fixed is equal to the removal_milestone of the deprecation
Edited by Diana Stanley

Merge request reports

Loading