Reduce Workhorse readiness calls to upstream Puma /-/readiness endpoint
Background
In !207192 (merged), we added a separate endpoint to support a Workhorse readiness endpoint that is responsible for checking the readiness of the downstream Puma server.
Workhorse makes its own async requests to Puma's /-/readiness
and the control app server on Puma to determine how many threads/workers are running.
What does this MR do and why?
Previously the Workhorse readiness checker periodically checked the
/-/readiness
endpoint, even if successful requests were recently relayed to the
Rails backend.
We can reduce these queries and increase the reliability of readiness
checks by skipping this call if we have recently relayed successful requests to
the Rails backend. By default this is configured to 20s via
rails_skip_interval
.
References
Relates to gitlab-com/gl-infra/production#20469
How to set up and validate locally
- In
config/puma.rb
, add this line:
activate_control_app 'tcp://127.0.0.1:9293', { no_token: true }
- In
workhorse/config.toml
, add this section:
[health_check_listener]
# Network type for the health check listener (tcp, tcp4, tcp6, unix)
network = "tcp"
# Address to bind the health check server to
addr = "localhost:8182"
puma_control_url = "http://localhost:9293"
- Build this branch:
git checkout sh-add-workhorse-skip-interval
make -C workhorse
gdk restart gitlab-workhorse
-
Run
gdk tail gitlab-workhorse
-
Access your GDK. Periodically run
curl -s http://localhost:8182/readiness | jq
. You should eventually seeskipped_due_to_recent_success
set totrue
:
{
"checks": {
"puma_readiness": {
"control_duration_s": 0.001491458,
"control_server": true,
"control_server_last_scrape_time": "2025-10-10T04:56:44Z",
"healthy": true,
"readiness_duration_s": 0,
"readiness_endpoint": true,
"skip_interval_s": 30,
"skipped_due_to_recent_success": true
}
},
"health_thresholds": {
"max_consecutive_failures": 1,
"min_successful_probes": 1
},
"metrics": {
"consecutive_failures": 0,
"consecutive_successes": 4
},
"ready": true
}
- If you do not access your GDK, you should see
skipped_due_to_recent_success
go back tofalse
:
{
"checks": {
"puma_readiness": {
"control_duration_s": 0.006726667,
"control_server": true,
"control_server_last_scrape_time": "2025-10-10T04:57:14Z",
"healthy": true,
"readiness_duration_s": 0.035112667,
"readiness_endpoint": true,
"readiness_last_scrape_time": "2025-10-10T04:57:14Z",
"skipped_due_to_recent_success": false
}
},
"health_thresholds": {
"max_consecutive_failures": 1,
"min_successful_probes": 1
},
"metrics": {
"consecutive_failures": 0,
"consecutive_successes": 7
},
"ready": true
}
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.