Prometheus metrics caused "GitLab: API is not accessible" errors

Here's what I think happened:

  1. We enabled Prometheus metrics and HUP'ed unicorn workers.
  2. Three Prometheus servers scrape the endpoint several times per minute.
  3. The metrics endpoint probes all the NFS servers and ends up executing lots of processes (500+!): https://gitlab.com/gitlab-org/gitlab-ce/issues/39730
  4. Somehow this caused git-06 to have stale NFS handles to nfs-04 (https://sentry.gitlap.com/gitlab/gitlabcom/issues/108280/activity/#event_6114)
  5. We also saw some metrics corruption (https://gitlab.com/gitlab-org/gitlab-ce/issues/39728)

We performed the following actions:

  1. Deactivated Prometheus metrics and HUP'ed unicorn
  2. Validated that nfs-04 was indeed a problem by calling Repository#lookup('master') in https://sentry.gitlap.com/gitlab/gitlabcom/issues/108280/activity/#event_6114
  3. Unmounted nfs-04 and remounted it on git-06
  4. Tested again, and things worked again

I think we need to do a number of things:

  1. Reduce the execve and NFS load (see https://gitlab.com/gitlab-org/gitlab-ce/issues/39730, https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/15158)
  2. Add better logging/alerts CE to report ESTALE instead of "NoRepository"; at least until we've migrated to Gitaly here
  3. Investigate the metrics corruption in https://gitlab.com/gitlab-org/gitlab-ce/issues/39728

What else?

Edited by Pablo Carranza [GitLab]