Improve `zero-downtime` health-check and re-design `/-/health`
Problem
There was a raised concern on gitlab-com/gl-infra/production#1268 (comment 234803172) that the health-check implemented on separate endpoint as #30201 (comment 224452030).
We did implement additional endpoint on Unicorn/Puma that published a health-check, but it had the following downsides:
- We were skipping check of Workhorse,
- We were not testing Unicorn/Puma HTTP endpoint that serves actual requests.
We for now circumvented that by using an old /-/health endpoint and doing a dual check effectively:
-
/-/healthgoing throughnginx,workhorsetounicorn/pumaover a regular HTTP listener, -
web_exporter/readinessthat goes directly toPuma/Unicornmaster over separate endpoint.
The problem is that the above workaround cannot be really used with any other load-balancers, as the approach that we took with HAProxy can be considered a workaround: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8241#note_234317356.
Proposal
Go back to using /-/readiness endpoint, and refactor it's meaning to align with what we did for web_exporter/readiness.
Make the:
-
/-/livenessto return status OK, always: this already behaves like that, -
/-/readinessremove probes to internal services (Redis/Gitaly/DB) as this is not aligned with definition of/readiness, and only check the local state of the service: can it accept the traffic.
This will:
- align the
/-/readinesswith the definition ofweb_exporter/readiness, - make us use a single endpoint for health-check that passes all layers (nginx/workhorse/unicorn/puma) and effectively achieve zero-downtime easily
Consideration
One of the problem of /-/readiness is that this contributes to application statistics. Maybe we could exempt this controller from gathering metrics on Rails and introduce separate metric for this controller only. That way we have APDEX calculated correctly.
I do consider this to be follow-up.
References
https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8241