Export Prometheus metrics for all HTTP requests made by runner manager
Overview
Instrument all HTTP requests made by the runner manager.
Background
In gitlab-com/gl-infra/production#19438 (closed), we noticed Sidekiq jobs would appear to be finished in the logs, but they never actually finished because the PUT requests to update the job took a long time or never completed. On the GitLab server logs, we saw a lot of EOF failures and Rack Attack rate limiting errors, but on the runner logs it was a bit harder to correlate whether there was a problem on the runner side.
One thing we discovered was that the retry mechanism wasn't even working on the runner: #38651 (closed). That will be fixed by !5409 (merged).
However, since we run a lot of shared and private runners, we should instrument all HTTP requests made by the runner manager. For example, we would like to know:
- The rate of GET, PUT, PATCH, and POST requests to the GitLab API.
- The number of retries for each endpoint.
- The status codes of each request.
- The duration of each request.
That way we can potentially come up with an Apdex for HTTP requests on the runner side.