Move Ruby health checks to a different listener

This is either a generalisation or an extension to #30037 (closed), proposed by @ayufan.

Proposal

Move all health checks to a different listener in Rails, only used for health checks and prometheus endpoint scrapes.

Reasoning

Currently health checks have the same very high timeouts as other endpoints passing from Workhorse to Rails. In fact, for most purposes, they are treated almost exactly the same as any other endpoint.

One downside to this approach is that a check for /-/health can take up to 30s to timeout.

During a unicorn/puma HUP or a restart, this means that healthchecks will be delayed for up to 30s, allowing traffic to pile up in a queue for servicing on the instance.

This traffic queues and once the service is started, it will immediately need to handle a "glut" of requests that have been queued up. This leads to CPU spikes and latency issues on top of the queuing latency that has already been incurred. This is made worse by the fact that the process is "new": caches will be cold, some lazy initialization will need to take place, etc.

This can be seen in this graph. A unicorn HUP occurred at 09h55 exactly.

This chart shows CPU (blue), RPS (yellow), and Apdex (green) over the course of a unicorn hup on a single node. CPU spikes as the code starts warming up, but at the same time needing to process a spike of already-delayed requests. This leads to the CPU spike and the apdex drop.

In this slack thread (https://gitlab.slack.com/archives/CB3LSMEJV/p1563385400477900), different approaches to resolving this we discussed.

Alternative proposal 1 - shorter health check timeouts

One approach discussed on the call would be to use a shorter timeout on health checks.

The problem with this approach is that it will lead to site-wide outages if we reach unicorn/puma worker saturation.

Since this situation has occurred several times in the past few weeks, it is definitely not something we want to deploy as it will lead to less resilience, not more.

Why will it fail. When worker saturation occurs, requests might queue for several seconds. This is not good, but it's better than a 502. However, since /-/health checks queue in the same queue as other requests, if we reduce their timeout, they will start failing.

This will lead to more pressure on the remaining workers, and a domino-effect of failing nodes.

Sending health check requests to the VIP entrance....

To fix this, we could use a shorter timeout, but make sure that health check requests (and prometheus scrape requests in https://gitlab.com/gitlab-org/gitlab-ce/issues/64423) "skip the queue".

Steps required:

Setup a listener in the Ruby parent process for handling health checks and prometheus scrapes only
Route /-/health, /-/liveness, /-/metrics, and /-/readiness endpoints to the new Ruby listener. In workhorse, use a very short (~1 second?) timeout on the /-/health requests.
Add logic to the ruby healthcheck to ensure that unicorn workers are ready before returning with "healthy"
Configure haproxy to use a much shorter (~1.5 seconds?) timeout on health check requests to workhorse.
During restart, the healthcheck will immediately return unhealthy, without delay, until the workers are started.
Once the workers start, the healthcheck will return healthy and haproxy will start sending requests.
During saturation, the healthcheck will still return as healthy, even with the short timeout, so requests will be queued up.

cc @stanhu @bjk-gitlab @jarv

Edited May 31, 2022 by 🤖 GitLab Bot 🤖