Skip to content

Add Workhorse health check listener

Background

In Cloud Native GitLab, currently we have configured a webservice pod that consists of two containers: gitlab-workhorse and webservice (Puma).

Both containers have their own readiness and liveness checks:

  1. gitlab-workhorse has a script that tries to see if a GET / request can be sent to TCP port 8181. It does not wait for the response.
  2. webservice is configured to scrape /-/liveness and /-/readiness.

We've seen numerous problems with these separate checks:

  1. The main issue appears that the Puma readiness endpoint (/-/readiness) can take longer than > 2 s if Puma is busy. With periodSeconds of 5 and failureThreshold of 2, a failure can be detected in 10 s, but a successful scrape will bring the pod back to the ready state within 5 seconds. This flapping is problematic with Cilium because in-flight TCP requests are reset after a pod is re-added.
  2. Ideally, we want a readiness probe that can definitively tell when Puma is up and is capable of handling requests in a reasonable amount of time.
  3. At the same time, we want to be able to detect an intentional shutdown or unintentional failure quickly.
  4. When a webservice pod is terminated, Workhorse currently receives the SIGTERM 10 s later than Puma to accommodate how the AWS Load Balancer operates. This is implemented with a preStop hook.
  5. This is why the frequency of the Puma readiness checks was increased: in order to detect a failure, Puma gets the SIGTERM signal first, and now Puma initiates a shutdown before Workhorse does. The longer it takes to detect a failure, the more likely is is that Workhorse sends traffic to downed Puma and receive 502 errors.

We can improve things significantly by:

  1. Adding a readiness endpoint for Workhorse that is responsible for checking the readiness of the downstream Puma server.
  2. Workhorse can make its own async requests to Puma's /-/readiness and the control app server on Puma to determine how many threads/workers are running. In addition, Workhorse could consider the number of queued requests, 500 errors, etc.
  3. As a result, the readiness check should always be fast, and we should rarely ever hit a 2-second timeout.
  4. We can then simplify the Helm Chart by removing the readiness check on Puma entirely. The webservice pod readiness will only be handled by Workhorse.
  5. To accommodate the AWS Load Balancer behavior, we should have a way to mark the readiness endpoint as not ready while still accepting new traffic for a certain time. Then we can eliminate the preStop hook.

What does this MR do and why?

This commit adds support for a health check listener that handles readiness probes, designed specifically for Kubernetes. It can optionally talk to the Puma control app and determine whether the app has booted.

To enable this, add this to the Workhorse config.toml:

[health_check_listener]
  network = "tcp"
  # Address to bind the health check server to
  addr = "localhost:8082"
  readiness_probes_url = "https://gdk.test:3443/-/readiness"

When configured, the readiness probe will:

  1. Attempt to contact the Puma control server (if specified with puma_control_url). If that shows no workers have booted, then the readiness will be set to false.

  2. If the control server is up, then the health check will attempt to probe the upstream /-/readiness endpoint.

The health checker listener has max_consecutive_failures, min_successful_probes, and check_interval that can be used to tweak the thresholds of when the readiness endpoint is marked as ready or not ready. Setting max_consecutive_failures and min_successful_probes to 1 will flagging ready or not ready immediately.

References

Relates to:

How to set up and validate locally

  1. Enable listening on a TCP port in Puma. In gdk.yml, add gitlab.address:
gitlab:
    address: localhost:9080
  1. Run gdk reconfigure.

  2. To activate the Puma control app, edit gitlab/config/puma.rb and add this line:

activate_control_app 'tcp://127.0.0.1:9293', { no_token: true }
  1. gdk restart rails-web

  2. Edit gitlab/workhorse/config.toml:

[health_check_listener]
  network = "tcp"
  addr = "localhost:8082"
  puma_control_url = "http://localhost:9293"
  1. In this branch, run make -C workhorse.

  2. gdk restart gitlab-workhorse

  3. You can observe the state of the readiness endpoint via curl -v http://localhost:8082/readiness. If you run gdk stop rails-web, you'll see the HTTP status code goes from 200 to 503. Running gdk start rails-web should bring it back. Also you can comment out activate_control_app and gdk restart rails-web to see the effect.

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Stan Hu

Merge request reports

Loading