Elasticsearch index sidekiq jobs are impacting GitLab.com availability metrics

Ignore the priority=throttled when alerting on queue sizes.

This will make sure we don't alert when there are a lot of queued jobs for those queues, since that is intentional, I think we'll need to update this PromQL query: https://gitlab.com/gitlab-com/runbooks/-/blob/815cd7d325dd46f7d59bad546b4c3044f1b30553/rules/service_saturation.yml#L560

Elasticsearch indexer jobs are classified as non-latency-sensitive sidekiq jobs.

As such, we allow up to 5m scheduling latency on these jobs.

Elasticsearch indexing is deliberately lowered to a single worker, forcing jobs to queue up. This mechanism is used to avoid overloading the elasticsearch service. So, an artificial bottleneck is introduced in the Sidekiq queue to ensure that we don't experience ElasticSearch saturation.

This situation is leading to several issues, however:

The elasticsearch worker is reporting at saturated at 100% of longer periods of time (see alert below)
Sidekiq SLOs are not being met (the ES index jobs are taking up to 16 minutes to be scheduled, well over the 5m objective)
Since Sidekiq's SLOs are not being met, and the Sidekiq service is part of the weighted average availability for GitLab.com (see gitlab-com/www-gitlab-com#5968 (closed)) this is having an impact on top-level metrics

Some questions:

Are we sure that 1 worker is the right throttle volume. Was 2 workers tested and found to be too much volume for ES? cc @mwasilewski-gitlab
Should we have a third latency_sensitive attribute value that indicates that very long wait durations (up to several hours?) are acceptable for these jobs.
1. This would mean that latency_sensitive goes from being a boolean to a enumeration, but will be a bit of work, but should be possible (also cc @smcgiven for Sidekiq Queue Selector syntax input)
2. We would also exclude any queues with this attribute from our saturation warnings as it's likely they will be saturated for long periods of time.
Should we include more histogram buckets (past the current maximum of 10m) to measure the scheduling latency on these queues

Edited Mar 31, 2020 by Bob Van Landuyt