Prometheus metrics caused "GitLab: API is not accessible" errors
Here's what I think happened:
- We enabled Prometheus metrics and HUP'ed unicorn workers.
- Three Prometheus servers scrape the endpoint several times per minute.
- The metrics endpoint probes all the NFS servers and ends up executing lots of processes (500+!): https://gitlab.com/gitlab-org/gitlab-ce/issues/39730
- Somehow this caused git-06 to have stale NFS handles to nfs-04 (https://sentry.gitlap.com/gitlab/gitlabcom/issues/108280/activity/#event_6114)
- We also saw some metrics corruption (https://gitlab.com/gitlab-org/gitlab-ce/issues/39728)
We performed the following actions:
- Deactivated Prometheus metrics and HUP'ed unicorn
- Validated that nfs-04 was indeed a problem by calling
Repository#lookup('master')
in https://sentry.gitlap.com/gitlab/gitlabcom/issues/108280/activity/#event_6114 - Unmounted nfs-04 and remounted it on git-06
- Tested again, and things worked again
I think we need to do a number of things:
- Reduce the
execve
and NFS load (see https://gitlab.com/gitlab-org/gitlab-ce/issues/39730, https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/15158) - Add better logging/alerts CE to report
ESTALE
instead of "NoRepository"; at least until we've migrated to Gitaly here - Investigate the metrics corruption in https://gitlab.com/gitlab-org/gitlab-ce/issues/39728
What else?
Edited by Pablo Carranza [GitLab]