Skip to content

Improve `zero-downtime` health-check and re-design `/-/health`

Problem

There was a raised concern on gitlab-com/gl-infra/production#1268 (comment 234803172) that the health-check implemented on separate endpoint as gitlab#30201 (comment 224452030).

We did implement additional endpoint on Unicorn/Puma that published a health-check, but it had the following downsides:

  1. We were skipping check of Workhorse,
  2. We were not testing Unicorn/Puma HTTP endpoint that serves actual requests.

We for now circumvented that by using an old /-/health endpoint and doing a dual check effectively:

  1. /-/health going through nginx, workhorse to unicorn/puma over a regular HTTP listener,
  2. web_exporter/readiness that goes directly to Puma/Unicorn master over separate endpoint.

The problem is that the above workaround cannot be really used with any other load-balancers, as the approach that we took with HAProxy can be considered a workaround: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8241#note_234317356.

Proposal

Go back to using /-/health endpoint, and refactor it's meaning to align with what we did for web_exporter/readiness.

Make the:

  1. /-/health/liveness to return status OK, always: this already behaves like that,
  2. /-/health/readiness remove probes to internal services (Redis/Gitaly/DB) as this is not aligned with definition of /readiness, and only check the local state of the service: can it accept the traffic.

This will:

  1. align the /-/health/readiness with the definition of web_exporter/readiness,
  2. make us use a single endpoint for health-check that passes all layers (nginx/workhorse/unicorn/puma) and effectively achieve zero-downtime easily

Related to: #4739 (closed)