sidekiq health status page / recognize Errors
Sidekiq piled up several thousand jobs - no alarm anyhwere to be seen
Healthcheck ( queue ) says everything is fine despite having 400.000 Jobs in the sidekiq queue. You MUST look directly at admin/background_jobs to see the issue
Who should be interested in this: Administrators
Every monitoring tool and/or administrator that gathers information from the health-check page: -/liveness? /-/readiness? Usergroup: Administrators
Further details
How did we manage go get a grip of that issue: gitlab-nginx was throwing random HTTP 500 Errors because the sidekiq was to busy answering. We cleared the particular queue that was overflowing. After that the HTTP 500 errors dropped to zero.
You could just see that error if you randomly reload the page. With 400K jobs it just stucks loading with HTTP 500 "sometimes".
Proposal
Make the Health check aware of the queue size. Warning and error threshold should should configurable to take different setups into account.
I think a solid number for warning should be something around:
Warning if : concurrent jobs * 60 seconds
Error if : concurrent jobs * 120 seconds
Possible this could also extend the timeline for the background jobs with a "yellow" or "red" range.
Permissions and Security
Same as current background/monitoring with token.
Documentation
Monitoring /liveness and /readiness queue will now reflect a warning when the number of concurrent sidekiq jobs in the queue exceeds a configurable threshold.
Testing
If the error HTTP 500 is not clearly visible within the monitoring, you get Jobs/Users that sometimes get stuck at a HTTP 500 error page.
What does success look like, and how can we measure that?
Monitoring /liveness and /readiness queue shows the queue and warns if queues are to high within the configured threshold.
Links / references
gitlab support case 118303 @aciciu
Questions ? Tag me.