2020-03-15 sidekiq queue project_export saturation

Summary

The sidekiq queue project_export was flooded with jobs much faster than it could complete them, leading to accumulating a backlog of jobs in the queue.

The resulting high queue duration caused a PagerDuty alert generically describing high latency of sidekiq as a whole, but really only the project_export queue was affected.

The sidekiq service (main stage) has a apdex score (latency) below SLO

The backlog of work took 15 minutes to complete, after which the saturation was resolved.

This alert is noisy. The apdex calculation underlying this alert frequently drops below its SLO target (although it often also recovers before alerting humans).

As has been commented several times, it may be worthwhile to revisit the goals served by this SLO and its accompanying alert. We certainly want to know when a subsystem is oversubscribed and affecting our users. Ideally we want to report this information in a useful, actionable manner:

Support rapid analysis of an unexpected increase in demand or loss of capacity or performance.
Provide input to planning and prioritizing work around elastic capacity adjustments (whether or not automated).
Tie to runbooks that give clear guidance on suggested diagnostic methods, known bad behaviors, normal vs. abnormal patterns and their associated SLI metrics, suggested responses to known problems, warnings about possible side-effects on other subsystems that share resources that may be heavily used by the alerting service, etc.

Timeline

All times UTC.

2020-03-15

16:01 - Many jobs start to be enqueued to the project_exports queue. Enqueue rate far exceeds dequeue rate, so a backlog of work begins accumulating.
16:05 - End of enqueuing jobs. Total of roughly 1000 jobs were enqueued over the preceding 4 minutes. The backlog of work is at its peak queue size. Sidekiq's pool of consumers for this queue (4 VMs plus 3 k8s pods) continue working steadily on this backlog for the next 15 minutes.
16:10 - PagerDuty alert triggers due to overall sidekiq job latency exceeding its SLO.
16:19 - End of regression. Queue size reaches 0. At its worst, the time spent waiting in the jobs queue (json.scheduling_delay_s) was trending around 760 seconds (12.6 minutes).

Resources

Dashboard "sidekiq: Overview"

https://dashboards.gitlab.net/d/sidekiq-main/sidekiq-overview?orgId=1&from=1584284400000&to=1584293400000

The sidekiq service as a whole has frequent violations of its apdex latency as currently defined. These regressions are often short enough that they do not trigger a PagerDuty alert, but in this case it did.

Note: The apdex calculation includes the time a job spends waiting in the queue, and we have an unbounded rate of enqueuing new jobs coupled with a bounded backing capacity for executing jobs. When demand exceeds capacity, jobs queue, and this alert notifies us if this saturation does not self-resolve quickly.

The length of the project_export queue grew abruptly, as jobs were being enqueued faster than they could be dequeued and completed.

Highlighting the export priority level often has no jobs to execute, but during the regression it was working steadily.

Kibana search - sidekiq `project_export` jobs slower than 10 seconds (excluding queue time)

https://log.gprd.gitlab.net/goto/dfb47cf53e0003d985ffda658b7637a8

This shows only the jobs that were slower than 10 seconds after reaching the head of the queue (json.duration). Time spent waiting in the queue is measured separately (json.scheduling_latency_s). Once the queue is clogged with a backlog of work, any new jobs (even quick ones) will have a high queue time (json.scheduling_latency_s).

88% of the 454 slow jobs belonged to a single specific project namespace. Follow-up confirmed that jobs were being enqueued by this namespace more quickly than they were being processed, hence the backlog of work accumulated.

Filtering to just that namespace and removing the duration filter, we can see that roughly 1000 project_export jobs ran for this project namespace, many of which were executed quickly but collectively they were being created and enqueued more quickly than the were dequeued and completed.

https://log.gprd.gitlab.net/goto/6799f08a9bed305b3e838be3189f5254

Edited Aug 03, 2020 by 🤖 GitLab Bot 🤖