2020-02-12: The elastic_indexer Sidekiq queue (main stage) is not meeting its latency SLOs

Summary

The alert for The elastic_indexer Sidekiq queue (main stage) is not meeting its latency SLOs fired. Looking at the dashboard for sidekiq it was noticed that the backlog for elastic_indexer was growing out of control

https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&from=1581537904303&to=1581559504303&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2

Timeline

All times UTC.

2020-020-12

23:20 - @ggillies noticed the alert here
23:32 - @ggillies messaged @DylanGriffith via slack for assistance as he had done some work on Elasticsearch integration for Gitlab in the past https://gitlab.slack.com/archives/C101F3796/p1581550367019400

2020-02-13

00:01 - @ggillies, @DylanGriffith and @changzhengliu jump into incident call to debug
01:18 - @ggillies restarted sidekiq via gitlab-ctl restart sidekiq-cluster and the issue went away

Corrective action to follow

Find all jobs that were dropped and replay them all to correct the gap in indexing
Understand what should be changed in application code or otherwise to prevent dropping these jobs since it appears that all jobs were happily finishing without error after 10s but they were doing nothing at all. Most didn't even have timeout exceptions.

Edited Feb 13, 2020 by Dylan Griffith