2020-02-12: The elastic_indexer Sidekiq queue (main stage) is not meeting its latency SLOs
Summary
The alert for The elastic_indexer Sidekiq queue (main stage) is not meeting its latency SLOs
fired. Looking at the dashboard for sidekiq it was noticed that the backlog for elastic_indexer was growing out of control
Timeline
All times UTC.
2020-020-12
- 23:20 - @ggillies noticed the alert here
- 23:32 - @ggillies messaged @DylanGriffith via slack for assistance as he had done some work on Elasticsearch integration for Gitlab in the past https://gitlab.slack.com/archives/C101F3796/p1581550367019400
2020-02-13
- 00:01 - @ggillies, @DylanGriffith and @changzhengliu jump into incident call to debug
- 01:18 - @ggillies restarted sidekiq via
gitlab-ctl restart sidekiq-cluster
and the issue went away
Corrective action to follow
-
Find all jobs that were dropped and replay them all to correct the gap in indexing -
Understand what should be changed in application code or otherwise to prevent dropping these jobs since it appears that all jobs were happily finishing without error after 10s but they were doing nothing at all. Most didn't even have timeout exceptions.
Edited by Dylan Griffith