sidekiq health status page / recognize Errors

Sidekiq piled up several thousand jobs - no alarm anyhwere to be seen

Healthcheck ( queue ) says everything is fine despite having 400.000 Jobs in the sidekiq queue. You MUST look directly at admin/background_jobs to see the issue

Who should be interested in this: Administrators

Every monitoring tool and/or administrator that gathers information from the health-check page: -/liveness? /-/readiness? Usergroup: Administrators

Further details

How did we manage go get a grip of that issue: gitlab-nginx was throwing random HTTP 500 Errors because the sidekiq was to busy answering. We cleared the particular queue that was overflowing. After that the HTTP 500 errors dropped to zero.

You could just see that error if you randomly reload the page. With 400K jobs it just stucks loading with HTTP 500 "sometimes".

Proposal

Make the Health check aware of the queue size. Warning and error threshold should should configurable to take different setups into account.

I think a solid number for warning should be something around: Warning if : concurrent jobs * 60 seconds
Error if : concurrent jobs * 120 seconds
Possible this could also extend the timeline for the background jobs with a "yellow" or "red" range.

Permissions and Security

Same as current background/monitoring with token.

Documentation

Monitoring /liveness and /readiness queue will now reflect a warning when the number of concurrent sidekiq jobs in the queue exceeds a configurable threshold.

Testing

If the error HTTP 500 is not clearly visible within the monitoring, you get Jobs/Users that sometimes get stuck at a HTTP 500 error page.

What does success look like, and how can we measure that?

Monitoring /liveness and /readiness queue shows the queue and warns if queues are to high within the configured threshold.

Links / references

gitlab support case 118303 @aciciu

Questions ? Tag me.