Elasticsearch index sidekiq jobs are impacting GitLab.com availability metrics
Ignore the priority=throttled
when alerting on queue sizes.
This will make sure we don't alert when there are a lot of queued jobs for those queues, since that is intentional, I think we'll need to update this PromQL query: https://gitlab.com/gitlab-com/runbooks/-/blob/815cd7d325dd46f7d59bad546b4c3044f1b30553/rules/service_saturation.yml#L560
As such, we allow up to 5m scheduling latency on these jobs.
Elasticsearch indexing is deliberately lowered to a single worker, forcing jobs to queue up. This mechanism is used to avoid overloading the elasticsearch service. So, an artificial bottleneck is introduced in the Sidekiq queue to ensure that we don't experience ElasticSearch saturation.
This situation is leading to several issues, however:
- The elasticsearch worker is reporting at saturated at 100% of longer periods of time (see alert below)
- Sidekiq SLOs are not being met (the ES index jobs are taking up to 16 minutes to be scheduled, well over the 5m objective)
- Since Sidekiq's SLOs are not being met, and the Sidekiq service is part of the weighted average availability for GitLab.com (see gitlab-com/www-gitlab-com#5968 (closed)) this is having an impact on top-level metrics
Some questions:
- Are we sure that 1 worker is the right throttle volume. Was 2 workers tested and found to be too much volume for ES? cc @mwasilewski-gitlab
- Should we have a third
latency_sensitive
attribute value that indicates that very long wait durations (up to several hours?) are acceptable for these jobs.- This would mean that
latency_sensitive
goes from being a boolean to a enumeration, but will be a bit of work, but should be possible (also cc @smcgiven for Sidekiq Queue Selector syntax input) - We would also exclude any queues with this attribute from our saturation warnings as it's likely they will be saturated for long periods of time.
- This would mean that
- Should we include more histogram buckets (past the current maximum of 10m) to measure the scheduling latency on these queues