Add Workhorse health check listener
Background
In Cloud Native GitLab, currently we have configured a webservice
pod that consists of two containers: gitlab-workhorse
and webservice
(Puma).
Both containers have their own readiness and liveness checks:
-
gitlab-workhorse
has a script that tries to see if aGET /
request can be sent to TCP port 8181. It does not wait for the response. -
webservice
is configured to scrape/-/liveness
and/-/readiness
.
We've seen numerous problems with these separate checks:
- The main issue appears that the Puma readiness endpoint (
/-/readiness
) can take longer than > 2 s if Puma is busy. WithperiodSeconds
of 5 andfailureThreshold
of 2, a failure can be detected in 10 s, but a successful scrape will bring the pod back to the ready state within 5 seconds. This flapping is problematic with Cilium because in-flight TCP requests are reset after a pod is re-added. - Ideally, we want a readiness probe that can definitively tell when Puma is up and is capable of handling requests in a reasonable amount of time.
- At the same time, we want to be able to detect an intentional shutdown or unintentional failure quickly.
- When a
webservice
pod is terminated, Workhorse currently receives theSIGTERM
10 s later than Puma to accommodate how the AWS Load Balancer operates. This is implemented with a preStop hook. - This is why the frequency of the Puma readiness checks was increased: in order to detect a failure, Puma gets the
SIGTERM
signal first, and now Puma initiates a shutdown before Workhorse does. The longer it takes to detect a failure, the more likely is is that Workhorse sends traffic to downed Puma and receive 502 errors.
We can improve things significantly by:
- Adding a readiness endpoint for Workhorse that is responsible for checking the readiness of the downstream Puma server.
- Workhorse can make its own async requests to Puma's
/-/readiness
and the control app server on Puma to determine how many threads/workers are running. In addition, Workhorse could consider the number of queued requests, 500 errors, etc. - As a result, the readiness check should always be fast, and we should rarely ever hit a 2-second timeout.
- We can then simplify the Helm Chart by removing the readiness check on Puma entirely. The
webservice
pod readiness will only be handled by Workhorse. - To accommodate the AWS Load Balancer behavior, we should have a way to mark the readiness endpoint as
not ready
while still accepting new traffic for a certain time. Then we can eliminate thepreStop
hook.
What does this MR do and why?
This commit adds support for a health check listener that handles readiness probes, designed specifically for Kubernetes. It can optionally talk to the Puma control app and determine whether the app has booted.
To enable this, add this to the Workhorse config.toml
:
[health_check_listener]
network = "tcp"
# Address to bind the health check server to
addr = "localhost:8082"
readiness_probes_url = "https://gdk.test:3443/-/readiness"
When configured, the readiness probe will:
-
Attempt to contact the Puma control server (if specified with
puma_control_url
). If that shows no workers have booted, then the readiness will be set tofalse
. -
If the control server is up, then the health check will attempt to probe the upstream
/-/readiness
endpoint.
The health checker listener has max_consecutive_failures
,
min_successful_probes
, and check_interval
that can be used to tweak
the thresholds of when the readiness endpoint is marked as ready or not
ready. Setting max_consecutive_failures
and min_successful_probes
to
1 will flagging ready or not ready immediately.
References
Relates to:
How to set up and validate locally
- Enable listening on a TCP port in Puma. In
gdk.yml
, addgitlab.address
:
gitlab:
address: localhost:9080
-
Run
gdk reconfigure
. -
To activate the Puma control app, edit
gitlab/config/puma.rb
and add this line:
activate_control_app 'tcp://127.0.0.1:9293', { no_token: true }
-
gdk restart rails-web
-
Edit
gitlab/workhorse/config.toml
:
[health_check_listener]
network = "tcp"
addr = "localhost:8082"
puma_control_url = "http://localhost:9293"
-
In this branch, run
make -C workhorse
. -
gdk restart gitlab-workhorse
-
You can observe the state of the readiness endpoint via
curl -v http://localhost:8082/readiness
. If you rungdk stop rails-web
, you'll see the HTTP status code goes from 200 to 503. Runninggdk start rails-web
should bring it back. Also you can comment outactivate_control_app
andgdk restart rails-web
to see the effect.
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.