Draft: Adds SidekiqSampler to collect time based Sidekiq metrics (!50710) · Merge requests · GitLab.org / GitLab

Matthias Käppler requested to merge 290768-add-sidekiq-sampler into master Dec 31, 2020

What does this MR do?

Related to #290768

What

This MR adds two Sidekiq Prometheus metrics that currently only exist in gitlab-exporter to the application itself. We are looking to retire gitlab-exporter entirely at some point, but cannot until all metrics we rely on are made available elsewhere.

The two metrics I migrated:

sidekiq_queue_size: This is the length of a particular queue in terms of the job objects awaiting execution. It is a gauge.
sidekiq_queue_latency_seconds: This is the elapsed time the oldest job had to wait in a given queue. It is therefore a measure of how fast we are processing jobs in that queue.

I also added:

sidekiq_sampler_duration_seconds_total: This is a counter accruing overall sampling time. It will tell us how much time we spend just sampling this data e.g. as a per-second rate.

Why

For some context, it might be good to understand why these metrics currently sit in gitlab-exporter and not the application itself. In fact, we already export sidekiq job metrics from a metrics middleware. However, a middleware only fires when a new job is being enqueued, but metrics such as queue latency are functions of time, not queue activity. In gitlab-exporter we solved this by regularly probing directly into Redis to inspect the state of all queues; this is done entirely outside of any application behavior in a separate process.

How

I followed the existing pattern, which is to introduce a SidekiqSampler daemon thread (akin to PumaSampler). These threads collect various in-app metrics at a defined frequency, which are then scraped from an exporter endpoint. This approach has benefits and drawbacks: a clear benefit is that one has full access to internal application state. The drawback is that we need to share resources such as CPU, memory and network with the application itself, so these samplers need to be very efficient in order to not create noticeable drag on application performance.

Another problem with this approach, specifically the new SidekiqSampler is that they might be performing more work than is necessary. If we have n Sidekiq instances, there is no actual need to ask for system-wide data such as queue lengths from more than one instance. This was not a problem with gitlab-exporter, because it only exists once. However, I believe that we can achieve similar performance e.g. by reducing the average sampling frequency and enough jitter, so that we can spread the load more broadly to not cause over-sampling.

As for execution speed and sample frequency, there are a couple things to unpack 👇

Sample frequency and Redis load

This sampler will run on every Sidekiq worker. Using this query as a proxy, there are no more than 500 workers pulling samples at any given time.

We furthermore sample at a given frequency with jitter applied, so that a sampler runs randomly (in 100ms steps) between n - n/2 and n + n/2 seconds. So with an interval of 10 seconds, we sleep at least 5 seconds between runs, and at most 15.

With 500 instances, if they are randomly distributed over this 10 second time window, we would sample 50 times a second on average. The main consideration here is the increase in load to Redis: With each sample we send 1 Redis query for a total of 50 QPS (see also the next section).

However, the number of commands Redis has to execute will be much higher, since it needs to evaluate lrange for all queues. Assuming 385 queues (see below), this would be 50 * 385 = 19250 commands/sec. This sounds like a lot, since it's 2/3 of our current ~30k or so commands per second we currently serve through the SK Redis cluster: https://dashboards.gitlab.net/d/redis-sidekiq-main/redis-sidekiq-overview?viewPanel=69320&orgId=1

As for connections, the worst case is that if all workers should happen to sample at the same time, there could be an extra num_workers ~ 500 of global connections.

Redis requests

Queue size and latency must be queried from Redis. I decided not to use the built-in Sidekiq:: helpers, because they do not efficiently collect queue latencies, for which it performs N+1 queries (fetch all queues, then compute latency on each queue.) The queue lengths function is much more efficient since it uses a pipelined query, so you only have 2 roundtrips: 1 to fetch all queues, and 1 pipelined query to collect N lengths via scatter-gather.

However, I reckoned that if I push all metrics I need into a single pipelined query, I get away with just 1 query (amortized): 1 to fetch queue names, whose result I cache over the lifetime of the worker so it remains constant over time, and 1 per each sample to fetch all N+M data points (lengths and latencies) for each queue using connection pipelining. I think we could push this even further by doing crazy stuff like executing scripts inside Redis, but I think this is good enough.

Time spent on CPU

One issue with computing queue latency is that we need to inspect the raw job data to compute a time delta. This means we need to parse the job JSON to find its enqueued_at element.

My worst case estimate for this is:

num_queues * parse_json_time

The number of jobs we need to inspect is capped by num_queues because we only need to peek at the first element of each queue, the oldest job. So assuming that all queues are busy processing jobs, then we need to look at num_queues jobs.

As for actual numbers. I wasn't sure how to best know the exact number of Sidekiq queues we have, but my local Redis claims it is 385. I also repeatedly timed parsing the job JSON of a Chaos::CpuSpinWorker job, and it quite consistently takes around 30us on my 16 CPU Core i9 -- with frequency capped to 2.4GHz.

So assuming those values are somewhat representative, we would have to spend around 11.5ms parsing job JSON. However, we aim for empty queues, not full queues, so this is for the worst case, e.g. when there is an outage.

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
[-] Tested in all supported browsers
Informed Infrastructure department of a default or new setting change, if applicable per definition of done

Edited Jan 05, 2021 by Matthias Käppler

Draft: Adds SidekiqSampler to collect time based Sidekiq metrics