Follow-up from "Refactor `readiness` and `liveness` probes and add them to `sidekiq|web_exporter`"
The following discussion from !17960 (merged) should be addressed:
-
@andrewn started a discussion: (+3 comments) Thanks for this @ayufan. As discussed on Slack, I have misgivings about using any downstream dependencies as part of our health check (as mentioned in !17955 (comment 224802453)) and from https://blog.colinbreck.com/kubernetes-liveness-and-readiness-probes-how-to-avoid-shooting-yourself-in-the-foot/
But consider what happens if there is a small, temporary increase in latency to one dependent service—maybe due to network congestion, a garbage-collection pause, or a temporary increase in load for the dependent service. If latency to the dependency increases to even slightly above one second, the readiness probe will fail and Kubernetes will no longer route traffic to the pod. Since all of the pods share the same dependency, it is very likely that all pods backing the service will fail the readiness probe at the same time. This will result in all pods being removed from the service routing. With no pods backing the service, Kubernetes will return HTTP 404, the default backend, for all requests to the service. We have created a single point of failure that renders the service completely unavailable, despite our best efforts to improve availability.[2] In this scenario, we would deliver a much better end-user experience by letting the client requests succeed, albeit with slightly increased latency, rather than making the entire service unavailable for seconds or minutes at a time.
Having said that, this moves us in the right general direction.
Also, we don't use
/-/readiness
on GitLab.com so we can continue to experiment on this endpoint for now. So LGTM.I don't think that this is documented anywhere what they do represent.
WRT to the content of the response, I would say that we should be specific about not defining it (and document that). Automated clients should rely only on the HTTP Status Code in order to determine the health of the service. The content should be there only to provide hints to humans trying to diagnose a problem.
As discussed in the call earlier, let's
-
Remove Redis and Database checks from our readiness checks for now: !18665 (merged) -
Move all LB checks over to the web exporter, using this readiness check: closed by #30201 (closed), -
Document and formalise this as the standard health check for GitLab: #34235 (closed)