Improve `zero-downtime` health-check and re-design `/-/health`

Problem

There was a raised concern on gitlab-com/gl-infra/production#1268 (comment 234803172) that the health-check implemented on separate endpoint as #30201 (comment 224452030).

We did implement additional endpoint on Unicorn/Puma that published a health-check, but it had the following downsides:

  1. We were skipping check of Workhorse,
  2. We were not testing Unicorn/Puma HTTP endpoint that serves actual requests.

We for now circumvented that by using an old /-/health endpoint and doing a dual check effectively:

  1. /-/health going through nginx, workhorse to unicorn/puma over a regular HTTP listener,
  2. web_exporter/readiness that goes directly to Puma/Unicorn master over separate endpoint.

The problem is that the above workaround cannot be really used with any other load-balancers, as the approach that we took with HAProxy can be considered a workaround: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8241#note_234317356.

Proposal

Go back to using /-/readiness endpoint, and refactor it's meaning to align with what we did for web_exporter/readiness.

Make the:

  1. /-/liveness to return status OK, always: this already behaves like that,
  2. /-/readiness remove probes to internal services (Redis/Gitaly/DB) as this is not aligned with definition of /readiness, and only check the local state of the service: can it accept the traffic.

This will:

  1. align the /-/readiness with the definition of web_exporter/readiness,
  2. make us use a single endpoint for health-check that passes all layers (nginx/workhorse/unicorn/puma) and effectively achieve zero-downtime easily

Consideration

One of the problem of /-/readiness is that this contributes to application statistics. Maybe we could exempt this controller from gathering metrics on Rails and introduce separate metric for this controller only. That way we have APDEX calculated correctly.

I do consider this to be follow-up.

References

omnibus-gitlab#4739 (closed)

!17960 (merged)

https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8241

Edited by Kamil Trzciński