Skip to content

Reduce Workhorse readiness calls to upstream Puma /-/readiness endpoint

Background

In !207192 (merged), we added a separate endpoint to support a Workhorse readiness endpoint that is responsible for checking the readiness of the downstream Puma server.

Workhorse makes its own async requests to Puma's /-/readiness and the control app server on Puma to determine how many threads/workers are running.

What does this MR do and why?

Previously the Workhorse readiness checker periodically checked the /-/readiness endpoint, even if successful requests were recently relayed to the Rails backend.

We can reduce these queries and increase the reliability of readiness checks by skipping this call if we have recently relayed successful requests to the Rails backend. By default this is configured to 20s via rails_skip_interval.

References

Relates to gitlab-com/gl-infra/production#20469

How to set up and validate locally

  1. In config/puma.rb, add this line:
activate_control_app 'tcp://127.0.0.1:9293', { no_token: true }
  1. In workhorse/config.toml, add this section:
[health_check_listener]
  # Network type for the health check listener (tcp, tcp4, tcp6, unix)
  network = "tcp"
  # Address to bind the health check server to
  addr = "localhost:8182"
  puma_control_url = "http://localhost:9293"
  1. Build this branch:
git checkout sh-add-workhorse-skip-interval
make -C workhorse
gdk restart gitlab-workhorse
  1. Run gdk tail gitlab-workhorse

  2. Access your GDK. Periodically run curl -s http://localhost:8182/readiness | jq. You should eventually see skipped_due_to_recent_success set to true:

{
  "checks": {
    "puma_readiness": {
      "control_duration_s": 0.001491458,
      "control_server": true,
      "control_server_last_scrape_time": "2025-10-10T04:56:44Z",
      "healthy": true,
      "readiness_duration_s": 0,
      "readiness_endpoint": true,
      "skip_interval_s": 30,
      "skipped_due_to_recent_success": true
    }
  },
  "health_thresholds": {
    "max_consecutive_failures": 1,
    "min_successful_probes": 1
  },
  "metrics": {
    "consecutive_failures": 0,
    "consecutive_successes": 4
  },
  "ready": true
}
  1. If you do not access your GDK, you should see skipped_due_to_recent_success go back to false:
{
  "checks": {
    "puma_readiness": {
      "control_duration_s": 0.006726667,
      "control_server": true,
      "control_server_last_scrape_time": "2025-10-10T04:57:14Z",
      "healthy": true,
      "readiness_duration_s": 0.035112667,
      "readiness_endpoint": true,
      "readiness_last_scrape_time": "2025-10-10T04:57:14Z",
      "skipped_due_to_recent_success": false
    }
  },
  "health_thresholds": {
    "max_consecutive_failures": 1,
    "min_successful_probes": 1
  },
  "metrics": {
    "consecutive_failures": 0,
    "consecutive_successes": 7
  },
  "ready": true
}

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Dylan Griffith

Merge request reports

Loading