Add /health endpoint to track application readiness
Description
We've already found the case in production that more than half the fleet looses connection to the database. This has been non-trivial to troubleshoot while the service is degraded. In fact we do have pending issues open in infrastructure to prevent this from happening again, or at least to simplify troubleshooting this form of situations.
Proposal
Add a /health
endpoint to the application and monitor the output with prometheus to have a clear immediate view of how the application is performing.
This endpoint should not be affected by the multiprocessing limitations of the prometheus ruby client because it would offer information right here and right now.
In this endpoint we should perform a set of checks to report the status of the application, for example:
- a DB ping (including latency)
- a redis ping
- probe FS access in all the possible shards
The reasoning behind this endpoint is that it would simplify troubleshooting the status of each of the workers removing the need to check logs to understand how each worker is going on.
This would also add the possibility of dynamically take traffic out of a given worker if it is not ready to take load by reporting states like temporarily unavailable (503), but also in the body of the reply we could explain why the service is not available.
I think that this would be extremely easy to implement and will provide a really clean way for the application to report in what state is it, removing the need to reverse engineer the state of it whenever we are experiencing an outage.
Bonus points
With this we could start opening the door for both prometheus monitoring GitLab, and start setting up the environment of a future deployment in a kubernetes cluster with autoscaling.