Improve `zero-downtime` health-check and re-design `/-/health`
Problem
There was a raised concern on gitlab-com/gl-infra/production#1268 (comment 234803172) that the health-check implemented on separate endpoint as #30201 (comment 224452030).
We did implement additional endpoint on Unicorn/Puma
that published a health-check, but it had the following downsides:
- We were skipping check of Workhorse,
- We were not testing Unicorn/Puma HTTP endpoint that serves actual requests.
We for now circumvented that by using an old /-/health
endpoint and doing a dual check effectively:
-
/-/health
going throughnginx
,workhorse
tounicorn/puma
over a regular HTTP listener, -
web_exporter/readiness
that goes directly toPuma/Unicorn
master over separate endpoint.
The problem is that the above workaround cannot be really used with any other load-balancers, as the approach that we took with HAProxy
can be considered a workaround: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8241#note_234317356.
Proposal
Go back to using /-/readiness
endpoint, and refactor it's meaning to align with what we did for web_exporter/readiness.
Make the:
-
/-/liveness
to return status OK, always: this already behaves like that, -
/-/readiness
remove probes to internal services (Redis/Gitaly/DB) as this is not aligned with definition of/readiness
, and only check the local state of the service: can it accept the traffic.
This will:
- align the
/-/readiness
with the definition ofweb_exporter/readiness
, - make us use a single endpoint for health-check that passes all layers (nginx/workhorse/unicorn/puma) and effectively achieve zero-downtime easily
Consideration
One of the problem of /-/readiness
is that this contributes to application statistics. Maybe we could exempt this controller from gathering metrics on Rails
and introduce separate metric for this controller only. That way we have APDEX calculated correctly.
I do consider this to be follow-up.
References
https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/8241