2019-01-29 PullMirrorsOverdueQueueTooLarge
The name of this alert: PullMirrorsOverdueQueueTooLarge
- The Pull Mirror Queue was flat lined.
- Running through the runbook wasn't entirely helpful
- The health of the infrastructure was ok
- When the queue was manually cleared, it doubled in size, and then went back into flat line again
- All pull mirror queues were busy at the time
- The root cause appears to have been started by a customer whom had a mirror setup, 197 of them, and all of them were failing.
- In this case, the timeout to fail was so lengthy it was hard to determine if sidekiq itself was at fault, or if there was another potential problem
- To resolve this, we've found the offender, disabled their mirrors, and restarted sidekiq.
Internal Converstaions
- https://gitlab.slack.com/archives/C8HG8D9MY/p1548765276247200
- https://gitlab.slack.com/archives/C4XFU81LG/p1548769496587300
- Unthreaded: https://gitlab.slack.com/archives/C101F3796/p1548762729848300
A Pretty Chart
- The first red line is when the queue was cleared
- The second redline is when we just restarted sidekiq on this particular fleet with hopes of resolving something
- The third redline is after we disabled the mirrors and restarted sidekiq
Sentry Errors
- https://sentry.gitlab.net/gitlab/gitlabcom/issues/615580/
- https://sentry.gitlab.net/gitlab/gitlabcom/issues/615672/
Goals for this Issue
- Discuss how we can better improve detection of failure cases like this - it look a REALLY long time and required assistance from a backend engineer
- Discuss if there should be any improvements made to this worker
- Address these discussions as new issues, additions to runbooks, or issues for GitLab