Implement Gitaly/Praefect health service
Currently Praefect only has a readiness RPC that also appears to be functioning as a "liveness" health endpoint. This readiness RPC executes a series of statically defined checks and if any of them fail the response from the RPC indicates that Praefect is not ready. This check can have downstream effects on the deployment and its availability. These same checks can also executed through Praefect CLI via the check
subcommand.
In #4598 (comment 1167683623) it was discussed whether the clock sync check is really a measure of "readiness".
- "health"/liveness: Is the application healthy, do we need to restart it
- "readiness": Is the application ready to accept traffic? Doesn't mean that traffic is going to be successful.
I propose we create a health service within Gitaly and Praefect that defines separate liveness and readiness endpoints. This would have the advantage of clearer semantics around what it means for the application to ready/alive and also allows individual Gitaly nodes to provide insight to their health independent of Praefect. Praefect health endpoints could perform their own checks and also query the health of its nodes through Gitaly health service.
- migration check
- Gitaly connectivity check
- Postgres read/write check
- unavailable repositories check
- clock sync check
These checks seem to be repurposed from some Praefect setup/preflight checks, but have since been included as a part of the readiness RPC. The existing checks could be separated into categories such as preflight, liveness, and readiness.
Other ideas that might be good to introduce:
- Ability to enable/disable various health checks.
- Allowing checks to be run asynchronously at configurable intervals independent of RPC invocation and cache the result. This could reduce spam of services that checks depend on (such as the NTP service with the clock sync check).
/cc @steveazz