Skip to content

2019-01-29 PullMirrorsOverdueQueueTooLarge

The name of this alert: PullMirrorsOverdueQueueTooLarge

  • The Pull Mirror Queue was flat lined.
  • Running through the runbook wasn't entirely helpful
    • The health of the infrastructure was ok
    • When the queue was manually cleared, it doubled in size, and then went back into flat line again
    • All pull mirror queues were busy at the time
  • The root cause appears to have been started by a customer whom had a mirror setup, 197 of them, and all of them were failing.
  • In this case, the timeout to fail was so lengthy it was hard to determine if sidekiq itself was at fault, or if there was another potential problem
  • To resolve this, we've found the offender, disabled their mirrors, and restarted sidekiq.

Internal Converstaions

A Pretty Chart

image

  • The first red line is when the queue was cleared
  • The second redline is when we just restarted sidekiq on this particular fleet with hopes of resolving something
  • The third redline is after we disabled the mirrors and restarted sidekiq

Sentry Errors

Goals for this Issue

  • Discuss how we can better improve detection of failure cases like this - it look a REALLY long time and required assistance from a backend engineer
  • Discuss if there should be any improvements made to this worker
  • Address these discussions as new issues, additions to runbooks, or issues for GitLab