Prometheus scrape endpoint is performing health checks, making Gitaly calls, timing out and raising GRPC errors

Related to GitLab.com production incident: https://gitlab.com/gitlab-com/infrastructure/issues/4556

During this incident, a single Gitaly server was put under extreme load. The other 19 Gitaly servers were healthy.

One strange side-effect of this, is that Prometheus marked much of the web fleet, api fleet and sidekiq fleet as down during the incident. Prometheus marks a job as down up{job="...."} -> 0 when it is unable to contact a Prometheus scrape endpoint, the scrape endpoint returns a non-successful HTTP status code, or the HTTP request takes longer than the configured scrape_timeout value (set to 15 seconds for the gitlab-unicorn job)

Prometheus scrape endpoints should be cheap to invoke, and not perform health checks during the call. They should return quickly and successfully with metric data, regardless of the health of backend dependent services. If this is not the case, it is impossible to tell the state of a service (in this case gitlab-unicorn when any single service of it's many backend dependent are down).

To make matters worst, the Prometheus endpoint was also throwing Ruby compilation error: NameError: uninitialized constant GRPC::GRPC during the outage. This should never happen.

Details of the errors can be found in ELK: https://log.gitlab.net/goto/d074ef822b018ca6d3a3595e4f6c99fb

The NameError is also reported in Sentry: https://sentry.gitlap.com/gitlab/gitlabcom/issues/250932/

The stack-trace is here:

gitaly/server.rb in rescue in info at line 53
gitaly/server.rb in info at line 51
gitaly/server.rb in storage_status at line 46
gitaly/server.rb in readable? at line 30
gitaly/server.rb in read_writeable? at line 26
gitlab/health_checks/gitaly_check.rb in block (2 levels) in metrics at line 17
gitlab/health_checks/base_abstract_check.rb in with_timing at line 32
gitlab/health_checks/gitaly_check.rb in block in metrics at line 17
gitlab/health_checks/gitaly_check.rb in each at line 16
gitlab/health_checks/gitaly_check.rb in flat_map at line 16
gitlab/health_checks/gitaly_check.rb in metrics at line 16
metrics_service.rb in each at line 18
metrics_service.rb in flat_map at line 18
metrics_service.rb in health_metrics_text at line 18
metrics_service.rb in metrics_text at line 24
metrics_controller.rb in index at line 8
action_controller/metal/implicit_render.rb in send_action at line 4

This seems to indicate that Prometheus is performing health checks of it's backend services within the scrape endpoint, meaning that the scrape endpoint will only return successfully when all 20 Gitaly servers are functioning correctly.

As the number of Gitaly servers tends up, into the hundreds and beyond, the chances of this healthcheck returning successfully will tend to zero, meaning that the Prometheus scrape endpoint will never return successfully.

If we want to publish healthcheck responses in Prometheus, this should be done in a separate thread, not in the prometheus scrape endpoint.

cc @bjk-gitlab @joshlambert @tiagonbotelho

Edited Jul 11, 2018 by Andrew Newdigate